Elon’s ‘Colossus’ Supercomputer Built With 100K H100 NVIDIA GPUs Goes Online, H200 Upgrade Coming Soon

Elon's 'Colossus' Super Computer Built With 100K H100 NVIDIA GPUs Goes Online, H200 Upgrade Coming Soon 1

The largest Supercomputer Colossus by Elon Musk's xAI goes online with 100K H100 GPUs and will double in size with 50K NVIDIA H200 GPUs soon.

Elon Musk's venture xAI has finally completed its development for the 'Colossus' Supercomputer, which went online on Labor Day a few days ago. Musk said that Colossus is the 'most powerful AI training system in the world' which was completed in 122 days from start to finish. The Colossus supercomputer uses 100,000 NVIDIA H100 data center GPUs, making it the largest training cluster to use such a huge number of H100s.

This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.

Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months.

Excellent…

— Elon Musk (@elonmusk) September 2, 2024

Elon also announced that in the upcoming months, the Colossus will be upgraded with 50,000 more H200 GPUs, which is the flagship datacenter GPU using the Hopper architecture. The H200 is significantly more powerful than the H100, bringing almost 45% higher compute performance in specific generative AI and HPC.

NVIDIA congratulated the xAI team for completing such a large project in just 4 months. NVIDIA added,

Colossus is powered by

's #acceleratedcomputing platform, delivering breakthrough performance with exceptional gains in #energyefficiency.

The xAI Colossus project was started in June in Memphis and its training commenced in July. This will prepare the GROK 3 by December replacing GROK 2 for delivering the most powerful AI in the world. The Colossus supercomputer came after the ending of the deal with Oracle, which was renting its server to xAI. The new supercluster is now more powerful than what Oracle could provide and is going to be doubled in performance in a few months with the addition of 50K more H200 GPUs.

The H200 brings almost 61GB higher memory, and a significantly higher memory bandwidth of 4.8TB/s compared to 3.35TB/s on the H100. That said, with such a drastic change in the specs, the H200 consumes 300W more power and will require liquid cooling just as the H100s in the Colossus utilize liquid cooling.

At the moment, Colossus is the only supercomputer that has reached 100K NVIDIA GPUs, followed by Google AI with 90K GPUs, and then the popular OpenAI, which uses 80K H100 GPUs. Meta AI and Microsoft AI are next with 70K and 60K GPUs.

News Source: @NVIDIADC