Russ d'Sa

CEO of LiveKit and expert on voice AI technology and tech business strategy.

Mentions

Appeared in

Discussed in

Key positions and views

The most significant barrier to the mainstream adoption of voice AI is solving 'turn detection' to enable natural conversational flow.

Achieving end-to-end latency below a 200-millisecond threshold is the critical benchmark for voice AI to match the fidelity of human-to-human conversation.

The future of voice AI is in models that process audio directly, but progress is currently bottlenecked by a scarcity of audio-native conversational training data.

Deep-tech companies like 23andMe can fail not because of their technology, but due to a flawed business strategy that doesn't account for the slower pace of the surrounding scientific or market ecosystem.

Founders must act as the primary product visionary and should strategically delay raising venture capital until after achieving product-market fit.

Podcast consensus on d'Sa

Points of consensus

▶Russ d'Sa consistently identifies low latency as a critical, non-negotiable requirement for voice AI to achieve natural, human-like interaction, citing a specific threshold of 200 milliseconds.Apr 2026

▶He repeatedly emphasizes that 'turn detection' — an AI's ability to know when a user has finished speaking — is one of the most difficult and crucial problems holding back the mainstream adoption of voice agents.Apr 2026

▶He presents a cohesive critique of 23andMe's failure, attributing it to a fundamental mismatch between its advanced technology and the slower pace of scientific discovery, compounded by a lack of a staged, strategic business model.Apr 2026

▶He consistently distinguishes between the current, prevalent 'cascaded' voice AI architecture (speech-to-text-to-LLM-to-speech) and the future ideal of models that process audio directly to achieve lower latency and higher fidelity.Apr 2026

Points of debate

▶D'Sa highlights the industry tension between the rapid advancement of AI models and the significant bottleneck caused by the lack of sufficient, high-quality, audio-based conversational training data needed to power next-generation systems.Apr 2026

▶He contrasts the solvable problem of one-on-one turn detection, which he predicts will be fixed within 12 months, with the much harder challenge of 'speaker diarization' in multi-person conversations.Apr 2026

▶He points to the significant gap between the potential low latency of models like GPT-4o (320ms) and their average real-world performance (600-700ms), indicating a conflict between capability and infrastructure stability.Apr 2026

▶He presents a strategic conflict for deep-tech companies, exemplified by 23andMe, where technological readiness outpaces the scientific understanding or market maturity required for a viable business model.Apr 2026

Key themes

▶The Quest for Conversational FidelityApr 2026

This theme centers on the technical hurdles to making AI voice interactions feel human. D'Sa identifies two primary obstacles: reducing end-to-end latency to a 200-millisecond threshold and solving 'turn detection' so the AI knows when to speak.

For investors, progress on latency and turn detection are key technical milestones that will unlock the next wave of mainstream voice AI applications, moving beyond simple commands to genuine conversational partners.

▶Architectural Evolution of Voice AIApr 2026

D'Sa outlines the industry's shift from inefficient 'cascaded' models (STT -> LLM -> TTS) to end-to-end systems that process audio directly. He notes this evolution is essential for low latency but is currently constrained by a scarcity of audio-native training data.

Analysts should monitor companies developing novel audio-native models and those who possess large, proprietary datasets of multi-person conversations (e.g., Zoom, Discord), as they hold a key advantage in this architectural shift.

▶Critique of Deep-Tech Business StrategyApr 2026

Using 23andMe as a case study, d'Sa argues that technological innovation alone is insufficient for success. He stresses the necessity of a staged business model that can layer in value over time, especially when the core technology outpaces the broader scientific or market ecosystem.

This serves as a cautionary tale for investors in capital-intensive, deep-tech fields like AI and biotech; evaluating the company's strategic, multi-stage business plan is as critical as assessing its technological prowess.

▶The Founder's Unyielding RoleApr 2026

D'Sa holds a strong conviction that the startup founder must be the ultimate product manager and visionary. He also advises founders to be strategic about fundraising, recommending they avoid venture capital until product-market fit is achieved.

This perspective suggests that d'Sa values founder-led product vision and capital efficiency, indicating a preference for startups that demonstrate clear market traction before scaling with external funding.

Source episodes

Sentiment over time

Not enough data for timeline

Changes over time

Early Career

Participated in the fifth batch of Y Combinator, gaining early exposure to the startup accelerator ecosystem.

Early-Stage Operator

Joined Twitter as its 75th employee and 23andMe as its 35th employee, gaining experience inside rapidly scaling, high-profile tech companies.

Post-23andMe

Formulated a critical analysis of 23andMe's strategic failures, focusing on its lack of a staged business model and the timing mismatch between its technology and the pace of genomics science.

Present

As CEO of LiveKit, d'Sa is now focused on solving core infrastructure problems in real-time communication, particularly the challenges of latency and turn detection in voice AI.

Suggested prompts

How does Russ d'Sa's experience at early-stage companies like Twitter and 23andMe inform his current views on product strategy and AI development at LiveKit? &nearr;Based on d'Sa's predictions for latency and turn detection, what new applications for voice AI might become commercially viable in the next 18-24 months? &nearr;What are the primary technical and data-related obstacles that must be overcome to achieve d'Sa's vision of 'autopilot' AI assistants? &nearr;How does d'Sa's critique of 23andMe's business model apply to other companies in the AI or biotech sectors that are similarly ahead of the scientific or market curve? &nearr;

Key concepts

Voice AI Latency 5 ep Turn Detection 3 ep Business Model Strategy 3 ep Cascaded AI Models 2 ep Direct Audio Processing 2 ep AI Training Data 2 ep Founder Responsibilities 2 ep Speaker Diarization 1 ep

Notable quotes

“And it's what I think is one of the hardest problems in voice AI and what's truly holding back... mainstream adoption of voice-based agents... And that is this concept known as turn detection.”

Russ d'Sa · LiveKit CEO Russ d'Sa - Voice AI and the Future of Human-Machine Interaction

“fundamentally, science moves slower than technology. And so, 23andMe was really early in the game, doing this kind of personalized genomics and personalized medicine as a goal.”

Russ d'Sa · LiveKit CEO Russ d'Sa - Voice AI and the Future of Human-Machine Interaction

“these changes are happening like within 200 milliseconds and so that's really the threshold that you know i think yes some turn taking can be longer but it can get as low as 200 milliseconds”

Russ d'Sa · LiveKit CEO Russ d'Sa - Voice AI and the Future of Human-Machine Interaction

“I would say that a year from now, you can expect that turn detection for one-on-one is solved. I think we're definitely going to get to that point within the next 12 to 18 months. I lean more towards 12 than 18.”

Russ d'Sa · LiveKit CEO Russ d'Sa - Voice AI and the Future of Human-Machine Interaction

Report last updated: Apr 21, 2026

Get started free

Back to Entities Intelligence Report