Google Cloud Next '26• Apr 23, 2026• 18:17ConferencePanel

A closer look at Gemma 4 with Baseten and NVIDIA

From Google Cloud Next '26 · 2026

Jason Davenport(Moderator, Google Cloud)•Philip Kelly(Panelist, Baseten)•Jay Raj(Panelist, NVIDIA)

Executive Summary

Google Cloud is strengthening its partnership with NVIDIA, announcing it will be one of the first cloud providers to offer NVIDIA's next-generation Vera Rubin hardware and is adding Blackwell RTX Pro 6000 GPUs to its offerings.
The focus of the discussion is on AI inference, highlighting the challenges and solutions for deploying and scaling models for real-world, low-latency applications.
Baseten is featured as a full-stack inference platform that leverages Google Cloud infrastructure (GKE) and NVIDIA hardware/software (TensorRT LLM) to provide scalable, reliable, and optimized model deployment for customers.
The conversation emphasizes the growing importance of the open-source model ecosystem, particularly Google's Gemma family, for its range of sizes and multi-modal capabilities, making it ideal for fine-tuning and enterprise use cases.

12 quotes

Concerns Raised

The inherent complexity of the full-stack inference problem, from CUDA kernels to distributed systems.
Significant software engineering effort required to migrate inference systems between GPU generations (e.g., Hopper to Blackwell).
Managing latency and reliability for complex, multi-model agentic systems at scale.

Opportunities Identified

Access to next-generation NVIDIA Vera Rubin and Blackwell GPUs on Google Cloud for enhanced performance.
Leveraging open-source models like Google's Gemma for cost-effective, fine-tuned, and multi-modal enterprise applications.
Using optimization SDKs like NVIDIA's TensorRT LLM to significantly improve inference performance with minimal code changes.
Building scalable, low-latency AI agents and compound systems using managed infrastructure like Google Kubernetes Engine (GKE).

Key Themes

AI Infrastructure Partnership

The episode highlights the strategic collaboration between Google Cloud and NVIDIA to bring the latest AI hardware, like Vera Rubin and Blackwell GPUs, to the cloud. This partnership aims to provide developers with cutting-edge infrastructure for both training and, increasingly, inference.

This matters because access to state-of-the-art GPUs on a major cloud platform is critical for companies to build and deploy powerful AI applications without massive upfront capital expenditure on hardware.

The Rise of AI Inference

The conversation moves beyond model training to focus on the practical challenges of inference—delivering fast, reliable, and scalable AI-powered user experiences. This involves a complex, full-stack approach encompassing hardware, software optimization, and distributed systems.

As AI moves from research to production, efficient inference is where value is realized. Optimizing for latency, throughput, and cost is the key challenge for building viable AI products and services.

Full-Stack Abstraction Platforms

Platforms like Baseten are emerging to simplify the immense complexity of the AI inference stack. They provide managed services for auto-scaling, GPU pooling, and model optimization, allowing developers to focus on application logic rather than infrastructure management.

These platforms democratize access to high-performance AI by lowering the barrier to entry, enabling more companies to deploy sophisticated models without needing a dedicated team of specialized inference engineers.

Open-Source Model Ecosystem

The utility of open-source models like Google's Gemma and NVIDIA's Nemotron is a central topic. The discussion highlights the benefits of having a range of model sizes for fine-tuning on specific tasks and the value of multi-modal capabilities for enterprise applications.

The open-source ecosystem provides flexibility and cost-effectiveness, allowing businesses to create customized, task-specific intelligence rather than relying solely on large, general-purpose proprietary models.

Performance Optimization and Scalability

The episode delves into specific techniques for optimizing AI workloads, such as using NVIDIA's TensorRT LLM for inference acceleration and leveraging Google Kubernetes Engine (GKE) for low-latency networking in multi-model systems. The goal is to maximize performance and manage costs effectively at scale.

For production AI, every millisecond and every dollar counts. These optimization tools and infrastructure choices directly impact user experience, operational costs, and the overall feasibility of an AI-driven business model.

Get started free

Topics

AI Inference NVIDIA Google Cloud Baseten Vera Rubin Blackwell GPUs RTX Pro 6000 Gemma Models TensorRT LLM Google Kubernetes Engine (GKE)Multi-modal AI Model Optimization Auto-scaling Fine-tuning Nemotron Models Inference Engineering Cloud GPUs

Processed Apr 28, 2026 yt-dlp + mlx-whisper + Gemini