▶Russ d'Sa consistently identifies low latency as a critical, non-negotiable requirement for voice AI to achieve natural, human-like interaction, citing a specific threshold of 200 milliseconds.Apr 2026
▶He repeatedly emphasizes that 'turn detection' — an AI's ability to know when a user has finished speaking — is one of the most difficult and crucial problems holding back the mainstream adoption of voice agents.Apr 2026
▶He presents a cohesive critique of 23andMe's failure, attributing it to a fundamental mismatch between its advanced technology and the slower pace of scientific discovery, compounded by a lack of a staged, strategic business model.Apr 2026
▶He consistently distinguishes between the current, prevalent 'cascaded' voice AI architecture (speech-to-text-to-LLM-to-speech) and the future ideal of models that process audio directly to achieve lower latency and higher fidelity.Apr 2026
▶D'Sa highlights the industry tension between the rapid advancement of AI models and the significant bottleneck caused by the lack of sufficient, high-quality, audio-based conversational training data needed to power next-generation systems.Apr 2026
▶He contrasts the solvable problem of one-on-one turn detection, which he predicts will be fixed within 12 months, with the much harder challenge of 'speaker diarization' in multi-person conversations.Apr 2026
▶He points to the significant gap between the potential low latency of models like GPT-4o (320ms) and their average real-world performance (600-700ms), indicating a conflict between capability and infrastructure stability.Apr 2026
▶He presents a strategic conflict for deep-tech companies, exemplified by 23andMe, where technological readiness outpaces the scientific understanding or market maturity required for a viable business model.Apr 2026
Not enough data for timeline
Sign up free to see the full intelligence report
Get started free