Super Data Science: ML & AI Podcast with Jon Krohn• May 3, 2026• 7:51Interview

Batch Inference Explained... with Popcorn! (feat. Linda Haviv)

Linda Haviv

Executive Summary

AI infrastructure is defined by its unique challenges compared to traditional infrastructure, primarily driven by compute-heavy, GPU-dependent workloads.
A new ecosystem of specialized providers, or "neoclouds" (e.g., CoreWeave, Lightning AI), is emerging to offer optimized, bare-metal solutions for AI, competing with generalist cloud providers.
Specialized open-source frameworks like Ray (for distributing Python-native workloads) and VLLM (for inference) are critical for managing the complexity and cost of modern AI systems.
A core economic and technical challenge is maximizing GPU utilization to avoid wasted resources, a problem often described as "starving your GPUs," with inference being a key area for optimization.

10 quotes

Concerns Raised

High cost and inefficiency from underutilized GPUs ('starving your GPUs').
The complexity of modern AI workloads compared to traditional infrastructure.
The rapid pace of change requires constant adaptation and new tooling.

Opportunities Identified

Growth of specialized 'neocloud' providers catering specifically to AI workloads.
Development of new open-source frameworks (like Ray, VLLM) to solve critical performance bottlenecks.
Increasing demand for new engineering roles like ML Platform Engineers to manage this specialized infrastructure.

Key Themes

The Emergence of Neoclouds

The discussion highlights the rise of specialized cloud providers, termed "neoclouds" (like CoreWeave, Lightning AI, Nibus), which focus exclusively on AI infrastructure. They provide solutions like direct bare-metal access, which is often more suitable for AI/ML workloads than the virtualized environments of traditional cloud providers.

This trend signifies a major shift in the cloud market, creating a new category of specialized players that can potentially offer better performance and cost-efficiency for AI companies than incumbent general-purpose clouds.

Defining AI Infrastructure

AI infrastructure is distinguished from traditional IT infrastructure by the unique demands of AI/ML workloads. These workloads are significantly more compute-heavy, less deterministic, and heavily reliant on specialized hardware like GPUs, necessitating new tools and architectural patterns.

Understanding this distinction is crucial for organizations building AI capabilities, as applying traditional DevOps principles and tools without modification can lead to inefficiency, high costs, and performance bottlenecks.

Specialized Tooling for AI Workloads

The conversation emphasizes the need for a new stack of tools designed for AI. Frameworks like Ray are essential for distributing compute-heavy, Python-native tasks, while tools like VLLM are built specifically to optimize the inference process, which is a common bottleneck.

The tools that dominated the big data era (e.g., Spark) are not always sufficient for modern AI. Companies must adopt a new generation of frameworks to effectively manage and scale their AI operations.

The Challenge of GPU Utilization

A recurring point is the critical importance of keeping expensive GPUs fully utilized to avoid wasting money. The concept of "not starving your GPUs" is central, where the goal is to ensure a constant flow of work to the hardware, often through techniques like batch inference.

GPU cost is a primary driver of AI operational expenses. Efficiently managing GPU workloads is a key lever for profitability and scalability in any AI-driven business.

Inference as the First Bottleneck

From a user or developer perspective, infrastructure challenges related to AI are most frequently encountered first during the inference stage. This is where the application meets real-world demand, exposing issues with latency, throughput, and cost.

This insight suggests that teams should prioritize optimizing their inference stack. It is the most common and visible point of failure or inefficiency in the AI application lifecycle.

Get started free

Topics

AI Infrastructure MLOps GPU Utilization Neoclouds Cloud Computing Ray Framework VLLM Inference Optimization Batch Inference Compute Workloads CoreWeave Lightning AI Python DevOps Cost Management

Processed May 4, 2026 yt-dlp + mlx-whisper + Gemini

You're reading a preview

Get started free →