Intel’s Gaudi2 Chip Is The Only Alternative To NVIDIA GPUs For LLM Training As Per MLPerf Benchmarks

Intel and Habana released MLPerf training benchmarks today and it contained some very interesting results. Intel's Gaudi2 chip is now the only alternative to NVIDIA GPUs for training LLMs. NVIDIA's stock price is absolutely soaring on the recent AI (aka LLM) goldrush owing to the company's GPUs being used to train pretty much all popular LLMs (like ChatGPT). The Intel Gaudi2 chip, however, is now the only viable alternative to NVIDIA's GPUs and they have released benchmarks that prove this.

ChatGPT is likely the most disruptive force the world has seen in a while and it is clear that the future is LLM. ChatGPT (free) is based on the GPT 3.5 model, which is in turn based on the GPT-3 base model. ChatGPT 4 is based on GPT-4 but information about that is extremely sparse and no benchmark exists for that. So training GPT-3 to a sufficient level of accuracy (or reduction of loss function) would be the most relevant benchmark when determining what to use as the training CPU/GPU. NVIDIA dominates this field using their Hopper GPUs but there is finally an alternative: Intel Gaudi2.

Intel is claiming better price/performance than the A100 right now in FP16 workloads and is targeting beating NVIDIA's H100 by September (in FP8 workloads). This is quite an ambitious goal but the company has benchmarks to back this up. Here is a quick high level overview of the results:

Gaudi2 delivered impressive time-to-train on GPT-31: 311 minutes on 384 accelerators.

Near-linear 95% scaling from 256 to 384 accelerators on GPT-3 model.

Excellent training results on computer vision — ResNet-50 8 accelerators and Unet3D 8 accelerators — and natural language processing models — BERT 8 and 64 accelerators.

Performance increases of 10% and 4%, respectively, for BERT and ResNet models as compared to the November submission, evidence of growing Gaudi2 software maturity.

Gaudi2 results were submitted “out of the box,” meaning customers can achieve comparable performance results when implementing Gaudi2 on premise or in the cloud.

To put the above into context, the NVIDIA entry can train GPT-31 in 45 minutes but also utilizes far more GPUs. In the end, the only way to make a proper comparison would be using TCO and knowing what the exact cost and TDP/heat constraints are. But all of that might be irrelevant because the demand far exceeds supply in this space. While NVIDIA GPUs are going to sell like hot cakes, their supply is limited and the market will be starved for silicon that can train LLMs - and that is where Intel's Gaudi2 can likely save the day.

Intel also shared results for its Xeon Platinum class of CPUs - which are currently used in the best performing MLPerf submission for LLM training which is just over 10 hours for GPT-3. Here are the result highlights:

In the closed division, 4th Gen Xeons could train BERT and ResNet-50 models in less than 50 mins. (47.93 mins.) and less than 90 mins. (88.17 mins.), respectively.

With BERT in the open division, the results show that Xeon was able to train the model in about 30 minutes (31.06 mins.) when scaling out to 16 nodes.

For the larger RetinaNet model, Xeon was able to achieve a time of 232 mins. on 16 nodes, allowing customers the flexibility of using off-peak Xeon cycles to train their models over the course of a morning, over lunch or overnight.

4th Gen Xeon with Intel Advanced Matrix Extensions (Intel AMX) delivers significant out-of-box performance improvements that span multiple frameworks, end-to-end data science tools and a broad ecosystem of smart solutions.