Intel Gaudi 3 AI Accelerator Official: 5nm, 128 GB HBM2e, Up To 900W, 50% Faster Than NVIDIA H100 & 40% More Efficient

Intel Gaudi 3 AI Accelerator Official: 5nm, 128 GB HBM2e, Up To 900W, 50% Faster Than NVIDIA H100 & 40% More Efficient 1

Intel has finally revealed its next-gen AI Accelerator, the Gaudi 3, based on a 5nm process node and competing directly against NVIDIA's H100 GPUs.

Intel's Gaudi AI accelerators have been a big competitor and the only alternative to NVIDIA's GPUs in the AI segment. We recently saw some heated benchmark comparisons between the Gaudi 2 & the NVIDIA A100/H100 GPUs with Intel showcasing its strong perf/$ lead while NVIDIA remained an overall AI leader in terms of performance. Now begins the third chapter in Intel's AI journey with its Gaudi 3 accelerator which has been fully detailed.

The company announced the Gaudi 3 accelerator which features the latest (5th Gen) Tensor Core architecture with a total of 64 tensor cores packed within two compute dies. The GPU itself has a 96 MB cache pool which is shared across both dies and there are eight HBM sites, each featuring 8-hi stacks of 16 Gb HBM2e DRAM for up to 128 GB capacities & up to 3.7 TB/s bandwidth. The entire chip is fabricated using TSMC 5nm process node technology and there are a total of 24 200GbE interconnect links.

In terms of product offerings, the Intel Gaudi 3 AI accelerators will come in both Mezzanine OAM (HL-325L) form factor with up to 900W standard and over 900W liquid-cooled variants & PCIe AIC with a full-height, double-wide and 10.5" length design. The Gaudi 3 HL-338 PCIe cards will come in passive cooling and support up to 600W TDP with the same specifications as the OAM variant.

The company also announced its own HLB-325 baseboard and HLFB-325L integrated subsystem which can carry up to 8 Gaudi 3 accelerators. This system has a combined TDP of 7.6 Kilowatts & measures 19".

The follow up to Gaudi 3 will come in the form of Falcon Shores which is expected for 2025 and will be combining both Gaudi and Xe IPs in a single GPU programming interface which is built around the Intel oneAPI specification.

Press Release: At Intel Vision, Intel introduces the Intel Gaudi 3 AI accelerator, which delivers 4x AI compute for BF16, 1.5x increase in memory bandwidth, and 2x networking bandwidth for massive system scale-out compared to its predecessor – a significant leap in performance and productivity for AI training and inference on popular large language models (LLMs) and multimodal models.

The Intel Gaudi 3 accelerator will meet these requirements and offer versatility through open community-based software and open industry-standard Ethernet, helping businesses flexibly scale their AI systems and applications.

How Custom Architecture Delivers GenAI Performance and Efficiency: The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process and offers significant advancements over its predecessor. It is designed to allow activation of all engines in parallel — with the Matrix Multiplication Engine (MME), Tensor Processor Cores (TPCs) and Networking Interface Cards (NICs) — enabling the acceleration needed for fast, efficient deep learning computation and scale. Key features include:

AI-Dedicated Compute Engine: The Intel Gaudi 3 accelerator was purpose-built for high-performance, high-efficiency GenAI compute. Each accelerator uniquely features a heterogeneous compute engine comprised of 64 AI-custom and programmable TPCs and eight MMEs. Each Intel Gaudi 3 MME is capable of performing an impressive 64,000 parallel operations, allowing a high degree of computational efficiency, making them adept at handling complex matrix operations, a type of computation that is fundamental to deep learning algorithms. This unique design accelerates the speed and efficiency of parallel AI operations and supports multiple data types, including FP8 and BF16.

Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth, and 96 megabytes (MB) of on-board static random access memory (SRAM) provide ample memory for processing large GenAI datasets on fewer Intel Gaudi 3s, particularly useful in serving large language and multimodal models, resulting in increased workload performance and data center cost efficiency.

Efficient System Scaling for Enterprise GenAI: Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator, providing flexible and open-standard networking. They enable efficient scaling to support large compute clusters and eliminate vendor lock-in from proprietary networking fabrics. The Intel Gaudi 3 accelerator is designed to scale up and scale out efficiently from a single node to thousands to meet the expansive requirements of GenAI models.

Open Industry Software for Developer Productivity: Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.

Gaudi 3 PCIe: New to the product line is the Gaudi 3 peripheral component interconnect express (PCIe) add-in card. Tailored to bring high efficiency with lower power, this new form factor is ideal for workloads such as fine-tuning, inference, and retrieval-augmented generation (RAG). It is equipped with a full-height form factor of 600 watts, with a memory capacity of 128GB and a bandwidth of 3.7TB per second.

Intel Gaudi 3 accelerator will deliver significant performance improvements for training and inference tasks on leading GenAI models. Specifically, the Intel Gaudi 3 accelerator is projected to deliver on average versus NVIDIA H100:

50% faster time-to-train across Llama2 7B and 13B parameters, and GPT-3 175B parameter models.

50% faster inference throughput and 40% greater inference power efficiency across Llama 7B and 70B parameters, and Falcon 180B parameter models. An even greater inference performance advantage on longer input and output sequences.

30% faster inferencing on Llama 7B and 70B parameters, and Falcon 180B parameter models against NVIDIA H200.

About Market Adoption and Availability: The Intel Gaudi 3 accelerator will be available to original equipment manufacturers (OEMs) in the second quarter of 2024 in industry-standard configurations of Universal Baseboard and open accelerator module (OAM). Among the notable OEM adopters that will bring Gaudi 3 to market are Dell Technologies, HPE, Lenovo, and Supermicro. General availability of Intel Gaudi 3 accelerators is anticipated for the third quarter of 2024, and the Intel Gaudi 3 PCIe add-in card is anticipated to be available in the last quarter of 2024.

The Intel Gaudi 3 accelerator will also power several cost-effective cloud LLM infrastructures for training and inference, offering price-performance advantages and choices to organizations that now include NAVER.