NVIDIA Hopper H100 & L4 Ada GPUs Achieve Record-Breaking Performance In MLPerf AI Benchmarks

NVIDIA Hopper H100 & L4 Ada GPUs Achieve Record-Breaking Performance In MLPerf AI Benchmarks

 0
NVIDIA Hopper H100 & L4 Ada GPUs Achieve Record-Breaking Performance In MLPerf AI Benchmarks

NVIDIA has just unveiled some record-breaking performances of its Hopper H100 & L4 Ada GPUs within MLPerf AI benchmarks.

Today, NVIDIA is presenting its latest figures achieved within MLPerf Interface 3.0. The three main highlights are the latest Hopper H100 records which show the progress of the flagship AI GPU over the past 6 months with several software optimizations, we also get to see the first results of the L4 GPU based on the Ada graphics architecture which was announced at GTC 2023 and lastly, we have updated results of the Jetson AGX Orin which gets much more faster thanks to similar software and platform power-level optimizations. To sum up, the following are the highlights that we are going to look at today:

  • H100 Sets New Inference Records: Up To 54% more performance vs prior submission
  • L4 Superchargers Mainstream Inference: Over 3X faster than T4
  • Another Big Leap for Jetson AGX Orin: Up To 57% more efficiency vs prior submission
  • For today's benchmark suite, NVIDIA will be taking a look at MLPerf Inference v3.0 which retains the same workloads which were used 6 months ago in prior submissions but the network environment has been added which accurately measures how data is sent into an inferencing platform to get the work done. NVIDIA also reveals that over a product's lifespan, the company can squeeze out almost 2x the performance through software optimizations, and that has already been seen on past GPUs such as the Ampere A100.

    Starting with the Hopper H100 performance tests, we see the MLPerf inference tests in offline and server categories. The Offline benchmarks show up to a 4.5x performance increase over Ampere A100 (BERT 99.9%) while in the Server scenario, the H100 yields an impressive 4.0x performance jump over its predecessor.

    To achieve this level of performance, NVIDIA is using FP8 performance through its transformer engine that is embedded within the Hopper architecture. It works on a per-layer basis by analyzing all of the work that is sent through it and then attests whether the data can be run in FP8 without compromising efficiency. If for example the data can be run in FP8, then it'll use that, if not, then the transformer engine will utilize the FP16 Math Ops and FP32 accumulate to run the data. Since Ampere didn't have a Transformer engine architecture, it was run on FP16+FP32 rather than FP8.

    Comparing their data to Intel's fastest 4th Gen Xeon Sapphire Rapids chip, the 8480+, the Hopper H100 GPU simply crushes it in all the performance tests and shows why GPUs are still the way to go in terms of inferencing despite Intel using a range of AI-based accelerators on their new chips.

    Moving over to the progression of Hopper's software side, since the 6 months of H100's availability, the GPU has seen up to 54% improvement, mainly in imaging-based networks. In 3D U-Net which is a medical imaging network, the H100 GPU sees a 31% uplift and even in BERT 99% which was shown above, the new chip gets a 12% boost over its previous benchmark submission. This is achieved by utilizing new software advances such as Optimized Non-Maximum Suppression Kernels and Sliding Window Batching on Sub-Volumes.

    NVIDIA's L4 has also made its first appearance within MLPerf. The small form factor L4 GPU was announced at GTC 2023 as a pure Tensor Core product that also supports FP8 instructions on Ada architecture although the Transformer engine is only specific to the Hopper GPUs. As a successor to the T4, the L4 GPU not only is an inference-first product but also features several video encode features for AI-based video encoding capabilities.

    As for performance, the NVIDIA L4 GPU delivers a massive performance increase of up to 3.1x over its predecessor, once again in BERT 99.9% and its 2x across the board in inference benchmarks at the same power.

    Being a small form factor design with a 72W power envelope means that the L4 can be utilized in a range of servers without having to redesign the server chassis or power delivery to host such a tiny card. Like its predecessor, the L4 is looking to be a really popular server and CSP product with almost all CSPs having a T4 instance Google has also recently announced its L4 instances which are already in private preview with more CSPs on the way.

    Lastly, we have the latest performance leaps delivered to Jetson AGX Orin through the Jetpack SDK. The Orin SOC has been out for a year now and NVIDIA is showcasing significant performance uplift. In performance alone, the Orin SOC gets up to an 81% boost while in power efficiency, the chip shows up to a 63% performance jump which is dramatic and a testament to NVIDIA's commitment to GPU and silicon longevity in the server space.

    These performance improvements aren't just limited to the Jetson AGX Orin but even the card-sized Orin NX which comes with 16 GB of onboard memory in a small form factor design gets up to a 3.2x performance uplift over the Xavier NX which is another great improvement and customers can expect even more performance in the future.

    While on the topic of MLPerf, Deci also announced that it has achieved record-breaking inference speed on NVIDIA GPUs at MLPerf. The below chart illustrates the throughput performance per TeraFLOPs as achieved by Deci and other submitters within the same category. Deci delivered the highest throughput per TeraFLOPs while also improving accuracy. This inference efficiency translates into significant cost savings on compute power and a better user experience. Instead of relying on more expensive hardware, teams using Deci can now run inference on NVIDIA’s A100 GPU, achieving 1.7x faster throughput and +0.55 better F1 accuracy, compared to when running on NVIDIA’s H100 GPU. This means a 68%* cost savings per inference query.

    Other benefits of Deci’s results include the ability to migrate from multi-gpu to a single GPU and lower inference costs and reduced engineering efforts. For example, ML engineers using Deci can achieve a higher throughput on one H100 card than on 8 NVIDIA A100 cards combined. In other words, with Deci, teams can replace 8 NVIDIA A100 cards with just one NVIDIA H100 card, while getting higher throughput and better accuracy (+0.47 F1).

    On the NVIDIA A30 GPU, which is a more affordable GPU, Deci delivered accelerated throughput and a 0.4% increase in F1 accuracy compared to an FP32 baseline.

    By using Deci, teams that previously needed to run on an NVIDIA A100 GPU can now migrate their workloads to the NVIDIA A30 GPU and achieve 3x better performance than they previously had for roughly a third of the compute price. This means dramatically better performance for significantly less inference cloud cost.

    What's Your Reaction?

    like

    dislike

    love

    funny

    angry

    sad

    wow