AMD Instinct MI300A APU With CDNA 3 GPU, Zen 4 CPU & Unified Memory Offers Up To 4x Speedup Versus Discrete GPUs In HPC

AMD's Instinct MI300X AI Throughput Performance & Latency Improved By 7x With GEMM Tuning 1

AMD's Instinct MI300A APUs deliver a substantial performance improvement in HPC workloads versus traditional discrete GPUs.

The AMD Instinct MI300A is the realization of the "Exascale APU" platform that was laid out years ago. The idea was to package a high-performance GPU alongside a high-performance CPU on the same package which harnesses a unified memory pool. For HPC, these accelerators/co-processor designs provide higher performance per watt advantages but require a lot of porting, tuning, and maintaining applications with millions of lines of code which can be a bit complicated. However, it looks like researchers have used two popular programming models, OpenMP and OpenACC, to fully utilize AMD's next-gen APU juggernaut.

For this research paper, titled "Porting HPC Applications to AMD Instinct MI300A Using Unified Memory and OpenMP", the OpenFOAM framework is used, which is an open-source C++ library:

We provide a blueprint of the APU programming model and demonstrate the ease and flexibility of porting codes on MI300A with OpenMP.

We elaborate our method for incremental acceleration of production and widely used in industry code—OpenFOAM.

Since the AMD Instinct MI300A accelerator uses a unified HBM interface, it eliminates the need for data replication and does not require a programming distinction between the host and the device memory spaces. Furthermore, AMD's ROCm software suite provides additional optimizations that help combine all segments of the APU in one coherent and heterogeneous package. As a tiny bit of a recap on AMD's Instinct MI300A APUs:

First Integrated CPU+GPU Package

Aiming Exascale Supercomputer Market

AMD MI300A (Integrated CPU + GPU)

153 Billion Transistors

Up To 24 Zen 4 Cores

CDNA 3 GPU Architecture

Up To 192 GB HBM3 Memory

Up To 8 Chiplets + 8 Memory Stacks (5nm + 6nm process)

As a result, the performance gets a huge benefit. In the evaluation using OpenFOAM's HPC motor bike benchmark, the AMD Instinct MI300A APU was tested against the AMD Instinct MI210, NVIDIA A100 80 GB and NVIDIA H100 (80 GB) GPU. The AMD GPUs were running on the ROCm 6.0 stack & the NVIDIA GPUs were running on the CUDA 12.2.2 stack. The benchmark was configured to run for 20 time steps with the average time of execution per time step (seconds) taken as the figure of merit (FOM). All three configurations besides the Instinct MI300A were using a discrete CPU so a socketed CPU was configured with heterogeneous memory management to allow the GPUs to address the system memory and run the benchmark.

Coming to the tests, the results were normalized to the NVIDIA H100 system which offered the best discrete GPU performance among the three discrete chips but the Instinct MI300A APU ended up with a 4x gain over the NVIDIA H100 and a 5x gain over the Instinct MI210 accelerator.

OndGPUs, more than 65% of the time is spent in page migrations: updating GPU tables and copying the data between host and device.

On APU, the unified physical memory shared between the CPU cores and GPU’s compute units completely removes the overhead of page migrations, resulting in a significant performance boost.

It was also discovered that the AMD Instinct MI300A with a singular Zen 4 CPU package was twice as fast as a single-socket Zen 4 CPU running with a discrete GPU solution. Overloading the MI300A APU with multiple processes further improved performance by 2x (test with 3-6 CPU cores per APU) which is much better than the lack of scalability on a dGPU+dCPU configuration.

As a result, it looks like the compute capabilities of the AMD Instinct MI300A APU are going to be unmatched in the HPC segment. NVIDIA has taken a back foot from traditional HPC performance in its next-gen Blackwell lineup as AI seems to be the major craze these days and while AMD is going to address that with its MI300X accelerators and its future refreshes, it looks like the HPC segment will bring AMD in the highlight.

News Source: Nicholas Malaya