Nvidia is raising its game in data centers, extending its reach across different types of AI workloads with the Tesla T4 GPU, based on its new Turing architecture and, along with related software, designed for blazing acceleration of applications for images, speech, translation and recommendation systems.
The T4 is the essential component in Nvidia's new TensorRT Hyperscale Inference Platform, a small-form accelerator card, expected to ship in data-center systems from major server makers in the fourth quarter.
The T4 features Turing Tensor Cores, which support different levels of compute precision for different AI applications, as well as the major software frameworks – including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2 – for so-called deep learning, machine learning involving multi-layered neural networks.
"The Tesla T4 is based on the Turing architecture, which I believe will revolutionize how AI is deployed in data centers," said Nvidia CEO Jensen Huang, unveiling the new GPU and platform at the company's GTC event in Tokyo Wednesday. "The Tensor Core GPU is a reinvention of our GPU – we decided to reinvent the GPU altogether."
The massively parallel architecture of GPUs make them well-suited for AI. Nvidia GPUs' parallel-computing capabilities are coupled with enough pure processing horsepower to be the technology of choice for AI for a number of years now, particularly in training data sets for machine learning – essentially creating deep learning neural-network models.
Multiprecision processing is an advantage for AI inferencing
The big step forward for the T4 GPUs and the new inference platform is the ability to do processing at more varying degrees of precision than the prior Nvidia P4 GPUs based on the Pascal architecture.
Once neural network models are trained on massive data sets, they are deployed into applications for inferencing — the classification of data to "infer" a result. While training is compute intensive, inferencing in deployed real-world applications requires as much flexibility as possible from processors.
Ideally, each level of a neural network should be processed with the least precision suitable for that layer, for application speed and power efficiency.
"By creating an architecture that can mix and match all of these mixed precisions we can maximize accuracy as well as throughput, all at 75 watts," Huang said, adding that T4s are at least eight times faster than P4s and in some cases 40 times faster.
The need for inferencing is growing rapidly, as data centers have put into production a wide variety of applications handling billions of voice queries, translations, images and videos, recommendations and social-media interactions. Nvidia estimates that the AI inference industry is poised to grow in the next five years into a $20 billion market. Different applications require different levels of neural network processing.
"You don't want to do 32-bit floating-point calculations if the applications requires 16-bit," said Patrick Moorhead, founder of analyst firm Moor Insights & Strategy. "Nvidia has totally raised the bar in the data center for AI with the new inferencing platform."
What is the TensorRT Hyperscale Inference Platform?
The components of the TensorRT Hyperscale Inference Platform, a small 75-watt PCIe form factor, include:
- The Nvidia Tesla T4 GPU, featuring 320 Turing Tensor Cores and 2,560 CUDA (Compute Unitfied Device Architecture) cores. CUDA is Nvidia's programming language for parallel processing. T4 multiprecision capabilities include FP16 (16-bit floating point arithmetic) to FP32, INT8 (8-bit integer arithmetic) and INT16. The T4 is capable of 65 teraflops of peak performance for FP16, 130 teraflops for INT8 and 260 teraflops for INT4.
- TensorRT 5, an inference optimizer and runtime for deep learning. It's designed for low-latency, high-throughput inference to quickly optimize, validate and deploy trained neural networks for inference in hyperscale data centers, embedded or auotomotive GPU platforms. It supports TensorFlow, MXNet, Caffe2 and Matlab frameworks and other frameworks via ONNX (Open Neural Network Exchange).
- The TensorRT Inference Server, which Nvidia is making available from its GPU Cloud as an inference server for data-center deployments. It's designed to scale-up both training and inferencing deployment to multicloud GPU clusters, and integrates with Kubernetes and Docker, letting developers automate deployment, scheduling and operation of multiple GPU application containers across clusters of nodes.
Software support is key
"We are continuing to invest and optimize our entire software stack from the bottom and we’re doing so by leveraging the available frameworks so everyone can run their neural networks turnkey out of the box right away – they can take their training models and turn around and deploy them that very day," said Ian Buck, vice president of Nvidia's Accelerated Computing business unit.
In the area of AI inferencing, Nvidia has seen competition from makers of FPGAs (field programmable gate arrays), particularly Xilinx. The programmability of FPGAs lets developers fine-tune the precision of the computation used for different levels of deep neural networks. But FPGAs have posed a steep learning curve for programmers. Customizing FPGAs was done for years via Hardware Description Languages (HDLs), rather than the higher-level languages used for other chips.
FPGAs offer competition for GPUs
In March, Xilinx unveiled what it calls a new product category – the Adaptive Compute Acceleration Platform (ACAP) – that will have more software support than traditional FPGAs. The first ACAP version, code-named Everest, is due to ship next year and Xilinx says that software developers will be able to work with Everest using tools like C/C++, OpenCL, and Python. Everest also can be programmable at the hardware, register-transfer level (RTL) using HDL tools like Verilog and VHDL.
But the software support offered by the T4 GPUs coupled with its multiprecision capabilities seem destined to fortify Nvidia's position in both AI training and inferencing.
"We believe we have the most efficient inferencing platform," Buck said, "We measure ourselves on the real production workloads that we're seeing today and are being seen by our customers – we work with all of them on our stack from top to bottom to make sure we offering not just the best training but now also the best inferencing platform."
Virtually all the server makers currently using the P4 GPUs will be on T4 by the end of the year, Buck said. At the Tokyo event, support for the T4 was voiced by data-center system makers including Cisco, Dell EMC, Fujitsu, HPE, IBM, Oracle and SuperMicro.
In addition, Google said it would be using the new T4s.