AMD MI300X Up To 3x Faster Than NVIDIA H100 In LLM Inference AI Benchmarks, Offers Competitive Pricing Too

AMD MI300X Up To 3x Faster Than NVIDIA H100 In LLM Inference AI Benchmarks, Offers Competitive Pricing Too 1

Tensorwave has published the latest benchmarks of the AMD MI300X in LLM Inference AI workloads, offering 3x higher performance than NVIDIA H100.

AI Cloud provider, Tensorwave, has showcased the performance of AMD's MI300 accelerator within AI LLM Inference benchmarks against the NVIDIA H100. The company is one of the many who are offering cloud instances powered by AMD's latest Instinct accelerators and it looks like AMD might just have the lead, not only in performance but also value.

In a blog post, Tensorwave demonstrates how AMD's MI300 and MK1's accelerated AI engines and models are accelerating the landscape with faster & optimized performance across multiple LLMs (Large Language Models).

The company uses the Mixtral 8x7B model and conducted both online & offline tests on AMD & NVIDIA hardware. The test setup included 8 MI300X accelerators, each with a 192 GB memory pool, and 8 NVIDIA H100 SXM5 accelerators, each with an 80 GB memory pool. AMD's setup was running the latest ROCm 6.12 driver suite with the MK1 inference engine and ROCm AI optimizations for vLLM v0.4.0 while NVIDIA's setup was running the CUDA 12.2 driver stack (the latest is CUDA 12.5) with the vLLM v4.3 inference stack.

AMD

Hardware: TensorWave node equipped with 8 MI300X accelerators, 2 AMD EPYC CPU Processors (192 cores), and 2.3 TB of DDR5 RAM.

MI300X Accelerator: 192GB VRAM, 5.3 TB/s, ~1300 TFLOPS for FP16

Drivers: ROCm 6.1.2

Inference Stack: MK1’s inference engine (Flywheel) v0.9.2 and AMD’s ROCm optimized fork of vLLM (rocm/vllm) v0.4.0.

Configuration: Tensor parallelism set to 1 (tp=1), since we can fit the entire model Mixtral 8x7B in a single MI300X’s 192GB of VRAM.

NVIDIA

Hardware: Baremetal node with 8 H100 SXM5 accelerators with NVLink, 160 CPU cores, and 1.2 TB of DDR5 RAM.

H100 SXM5 Accelerator: 80GB VRAM, 3.35 TB/s, ~986 TFLOPS for FP16

Drivers: CUDA 12.2

Inference Stack: vLLM v4.3

Configuration: Tensor parallelism set to 2 (tp=2), which is required to fit Mixtral 8x7B in two H100’s 80GB VRAM.

Notes

All benchmarks are performed using the Mixtral 8x7B model.

All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.

To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.

In terms of offline performance, the AMD MI300X AI accelerator showcased a performance uplift of 22%, all the way up to 194% (almost 3X) compared to the NVIDIA H100 across batch sizes that ranged from 1 to 1024. The MI300X accelerator was faster than the H100 across all batch sizes.

In online performance, Tensorwave designed a series of online tests to simulate realistic typical chat applications. The key metrics of interest are:

Throughput (Requests per Second): The number of requests the system can handle per second for a given workload.

Average Latency (Seconds): The average time taken to generate a full response for each request.

Time Per Output Token (TPOT): The time to generate each subsequent token (averaged) after the first token, which impacts the overall speed of generating long responses.

Here, the AMD MI300X accelerator offers 33% more requests per second versus two NVIDIA H100 GPUs while maintaining an average latency of 5 seconds. The MI300X accelerator also offers much higher throughput than the H100, generating text faster at higher volumes of traffic.

Note - It is mentioned that NVIDIA's Hopper H100 GPUs were running the vLLM suite and not the TensorRT-LLM optimizations plus they are running an older CUDA stack from last year. The latest optimizations in the software stack have already boosted the performance of the NVIDIA Hopper GPUs by quite a big margin.

Our benchmarks demonstrate that AMD's MI300X outperforms NVIDIA's H100 in both offline and online inference tasks for MoE architectures like Mixtral 8x7B. The MI300X not only offers higher throughput but also excels in real-world scenarios requiring fast response times.

Given its impressive performance, competitive cost, and hardware availability, the MI300X with MK1 software is an excellent choice for enterprises looking to scale their AI inference capabilities.

via Tensorwave

In the end, Tensorwave appreciated the high performance and very competitive pricing of AMD's MI300X accelerators against the NVIDIA H100. The company's CEO has already highlighted MI300X as the far superior option versus H100. The MI300X is also said to be readily available versus the H100 which is mostly booked out. You can learn more about Tensorwave's MI300X cloud instances here.