TwiML AI Podcast• Apr 30, 2026• 54:21Interview

How to Engineer AI Inference Systems [Philip Kiely] - 766

From TwiML AI Podcast

Philip Kiely•Head of AI Education, Base10

Executive Summary

Inference engineering is a critical and rapidly evolving discipline, combining GPU programming, distributed systems, and applied AI research, with demand for skilled engineers projected to grow 10-100x.
Companies with scaled AI products are maturing from per-token API models to dedicated deployments on specialized infrastructure to control costs and performance, a trend described as "owning your intelligence." NVIDIA's Hopper (H100) GPUs remain highly valuable and in demand for inference, even with the rollout of Blackwell, due to software optimization, export controls affecting research, and their suitability for smaller models.
The future of AI hardware may involve compute disaggregation (specialized chips for pre-fill vs.
decode) and ASICs, but sophisticated software and open-source inference engines remain essential for orchestrating these complex systems.

12 quotes

Concerns Raised

The extreme complexity of building and maintaining high-performance inference systems.
A significant talent gap, with companies unable to hire knowledgeable inference engineers fast enough to meet demand.
The rapid pace of research and hardware development, which requires constant adaptation and software updates.

Opportunities Identified

Massive cost savings for companies that optimize their inference stack, as demonstrated by Shopify.
The growing demand for specialized inference providers and skilled inference engineers.
Performance breakthroughs from new hardware approaches like ASICs and compute disaggregation.
Building differentiated AI products by taking control of the model and inference layer.

Key Themes

The Rise of Inference Engineering

The episode defines inference engineering as a complex, multi-disciplinary field crucial for AI product success. It requires expertise in GPU programming, distributed systems, and the rapid application of new AI research, making it one of the most challenging and important roles in the industry.

This highlights a new, highly in-demand specialization within AI/ML. Understanding its complexity is key for companies building scalable and cost-effective AI products and for individuals planning their careers.

AI Product Maturity Cycle

A clear progression for AI-native products is outlined, starting with easy-to-use per-token APIs, moving to dedicated deployments on hyperscalers or specialized providers to manage cost and capacity, and in some cases, progressing to in-house platforms or custom hardware.

This provides a strategic roadmap for businesses planning their AI infrastructure. It helps them anticipate when and why they might need to evolve their stack to optimize for cost, performance, and control.

Hardware and GPU Economics

The discussion covers the lifecycle and staying power of NVIDIA GPUs, noting that Hopper (H100) rental prices have increased over the past year. Factors like software optimization, export controls to China driving Hopper-centric research, and the suitability for smaller models give older generations surprising longevity.

This challenges the assumption that the latest hardware is always necessary. It informs strategic procurement and infrastructure decisions, showing that a nuanced understanding of the hardware landscape is crucial for cost-effective scaling.

Software and Hardware Co-evolution

The conversation emphasizes the symbiotic relationship between hardware and software. Innovations like ASICs (e.g., Talos) and compute disaggregation require a sophisticated software layer, often built upon open-source inference engines like VLLM and TensorRT-LLM, to unlock their full potential.

This underscores that simply acquiring advanced hardware is insufficient. True performance gains and competitive advantages come from the deep integration of a specialized software stack that can fully leverage the hardware's capabilities.

Ownership of Intelligence

A key strategic trend is companies moving to "own their intelligence" by controlling their models and inference stack, rather than being entirely dependent on third-party APIs. This allows for greater product differentiation, cost savings, and performance tuning, as exemplified by Shopify's move to a custom model.

This signals a major shift in the AI industry towards building defensible moats at the model and infrastructure level. It's a critical consideration for any company where AI is a core component of its product offering.

Get started free

Topics

Inference Engineering AI Infrastructure GPU Economics Model Serving NVIDIA H100 Hopper Architecture Blackwell Architecture AI Product Maturity Cost Optimization CUDA Quantization Compute Disaggregation ASICs VLLM TensorRT-LLM US Export Controls

Processed May 4, 2026 yt-dlp + mlx-whisper + Gemini

You're reading a preview

Get started free →