World of DAS• May 27, 2025• 1:04:50Interview

LiveKit CEO Russ d'Sa - Voice AI and the Future of Human-Machine Interaction

From World of DAS

Russ d'Sa•CEO, LiveKit

Executive Summary

Current voice AI relies on a slow, multi-step "cascaded" model (Speech-to-Text -> LLM -> Text-to-Speech), which introduces significant latency.
Achieving natural, human-like conversation requires solving two key challenges: reducing end-to-end latency to under 300 milliseconds and mastering "turn detection"—knowing when a user has finished speaking.
The future of voice AI lies in end-to-end models that process audio directly, bypassing text conversion, which will enable much lower latency and more fluid interactions.
AI assistants are expected to evolve along two primary paths: "copilots" that augment human creativity and "autopilots" that automate mundane tasks.

12 quotes

Concerns Raised

High latency and performance variance in current voice AI systems hinder natural conversation.
"Turn detection" remains a largely unsolved and critical problem for conversational flow.
The current need for complex, multi-turn text prompting for AI agents is a significant usability flaw.
Lack of high-quality, large-scale datasets for multi-person conversations (speaker diarization).

Opportunities Identified

Developing end-to-end models that process audio directly to achieve sub-200ms latency.
Solving turn detection to unlock a new level of natural, fluid human-AI interaction.
Building AI products based on the "copilot" (creative partner) and "autopilot" (task automation) frameworks.
Leveraging large, untapped datasets from platforms like Zoom or Discord to train advanced conversational models.

Key Themes

The Architecture of Voice AI

The discussion contrasts the prevalent "cascaded" or component-based model for voice AI with the future of end-to-end models. The cascaded approach involves a pipeline of speech-to-text, LLM processing, and text-to-speech, while future models will process audio directly to reduce complexity and latency.

This architectural shift is the core technical challenge and opportunity in voice AI. Overcoming the limitations of the cascaded model is essential for creating seamless, real-time conversational experiences.

The Quest for Low-Latency Conversation

A major focus is the critical need to reduce end-to-end latency to below 300 milliseconds to mimic the flow of human conversation. While models like GPT-4o can achieve this, high variance between average and worst-case performance remains a significant hurdle.

Latency is the primary bottleneck preventing voice AI from feeling truly natural. Achieving consistent, ultra-low latency is the key that will unlock mainstream adoption and more sophisticated applications.

The 'Turn Detection' Problem

Turn detection, the ability for an AI to know when a user has finished speaking and when it's appropriate to respond or interrupt, is identified as one of the hardest unsolved problems. It's crucial for moving beyond stilted, command-response interactions to fluid, dynamic dialogue.

Solving turn detection is a fundamental requirement for creating believable and effective AI agents. It represents a major frontier in conversational AI that, once crossed, will dramatically improve user experience.

The Evolution of AI Assistants: Copilots vs. Autopilots

The speaker predicts that AI assistants will diverge into two main categories. "Copilots" will act as creative partners, assisting with tasks like coding and design, while "autopilots" will function as autonomous agents that handle mundane, delegable work.

This framework provides a strategic lens for product development and market positioning in the AI space. It helps define the nature of human-AI collaboration, distinguishing between tools that augment creativity and those that replace labor.

Lessons from a Serial Entrepreneur

Drawing on experiences at 23andMe and as a founder, the speaker shares hard-won business insights. Key lessons include the necessity of achieving product-market fit before raising VC, the founder's irreplaceable role as product visionary, and the danger of a technology's timing being too far ahead of market readiness.

These insights offer practical, first-principles advice for founders navigating the challenges of building a technology company, emphasizing strategic patience, capital efficiency, and founder-led vision.

Get started free

Topics

Voice AI Real-time Audio Latency Turn Detection Cascaded Models End-to-End Models Speech-to-Text (STT)Text-to-Speech (TTS)Large Language Models (LLMs)AI Assistants Copilots Autopilots OpenAI ChatGPT Voice LiveKit Startup Strategy Product-Market Fit 23andMe Speaker Diarization

Processed Apr 9, 2026 yt-dlp + mlx-whisper + Gemini