Google Leverages NVIDIA’s L4 GPUs To Let You Run AI Inference Apps On The Cloud

Google Leverages NVIDIA's L4 GPUs To Let You Run AI Inference Apps On The Cloud 1

Google has leveraged NVIDIA's L4 GPUs to offer users the ability to run AI inference applications such as GenAI on the cloud.

Press Release: Developers love Cloud Run for its simplicity, fast autoscaling, scale-to-zero capabilities, and pay-per-use pricing. Those same benefits come into play for real-time inference apps serving open-generation AI models. That's why today, we’re adding support for NVIDIA L4 GPUs to Cloud Run, in preview.

This opens the door to many new use cases for Cloud Run developers:

Performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chatbots or on-the-fly document summarization, while scaling to handle spiky user traffic.

Serving custom fine-tuned gen AI models, such as image generation tailored to your company's brand, and scaling down to optimize costs when nobody's using them.

Speeding up your compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.

As a fully managed platform, Cloud Run lets you run your code directly on top of Google’s scalable infrastructure, combining the flexibility of containers with the simplicity of serverless to help boost your productivity. With Cloud Run, you can run frontend and backend services, batch jobs, deploy websites and applications, and handle queue processing workloads — all without having to manage the underlying infrastructure.

At the same time, many workloads that perform AI inference, especially applications that demand real-time processing, require GPU acceleration to deliver responsive user experiences. With support for NVIDIA GPUs, you can perform on-demand online AI inference using the LLMs of your choice in seconds. With 24GB of vRAM, you can expect fast token rates for models with up to 9 billion parameters, including Llama 3.1(8B), Mistral (7B), and Gemma 2 (9B). When your app is not in use, the service automatically scales down to zero so that you are not charged for it.

Today, we support attaching one NVIDIA L4 GPU per Cloud Run instance, and you do not need to reserve your GPUs in advance. To start, Cloud Run GPUs are available today in us-central1(Iowa), with availability in Europe-west4 (Netherlands) and Asia-southeast1 (Singapore) expected before the end of the year.

Cloud Run makes it super easy to host your web applications. And now with GPU support, we are extending the best of serverless, simplicity, and scalability to your AI inference applications too! To start using Cloud Run with NVIDIA GPUs, sign up at g.co/cloudrun/gpu to join our preview program today and wait for our welcome email.