Sequoia Capital AI Ascent 2026• Apr 30, 2026• 19:51ConferenceKeynote

Robotics' End Game: Nvidia's Jim Fan

From Sequoia Capital AI Ascent 2026 · 2026

Jim Fan•Lead, Embodied Autonomous Research Group, NVIDIA

Executive Summary

NVIDIA's Jim Fan proposes a new paradigm for robotics, "The Great Parallel," which applies the successful development roadmap of Large Language Models (LLMs) to the physical world.
The dominant model architecture is shifting from Vision Language Action (VLA) models, which are poor at physics, to World Action Models (WAMs) that learn physics emergently by simulating future world states from video.
Data collection is evolving from inefficient teleoperation to highly scalable methods, with egocentric human video predicted to become the primary data source, enabling a neural scaling law for dexterity.
The long-term vision for robotics involves passing a "Physical Turing Test," creating a "Physical API" for lights-out factories, and ultimately achieving "Physical Auto-Research" where robots design and build their successors.

12 quotes

Concerns Raised

Current Vision Language Action (VLA) models are ill-suited for robotics as they prioritize language over physics.
Teleoperation as a data collection method is fundamentally unscalable and a major bottleneck to progress.
Reliance on classical physics engines limits the scale and fidelity of simulations needed for reinforcement learning.

Opportunities Identified

Applying the proven LLM development paradigm (pre-train, fine-tune, RL) to robotics.
Leveraging massive-scale egocentric human video as the primary data source for training generalist robots.
Developing World Action Models (WAMs) that learn an intuitive understanding of physics from video.
Using neural simulators to create infinite, data-driven training environments for reinforcement learning.

Key Themes

The Great Parallel: Applying the LLM Playbook to Robotics

The core thesis is that robotics can replicate the success of LLMs by following a similar three-step process: pre-training on vast amounts of world data (simulating the physical world), fine-tuning with specific action data, and using reinforcement learning for optimization. This copies the pre-training, SFT, and RLHF stages that led to breakthroughs in language.

This provides a clear and proven roadmap for accelerating progress in robotics, shifting the focus from bespoke engineering to scalable, data-driven foundation models.

Paradigm Shift from VLA to World Action Models (WAMs)

The talk argues that Vision Language Action (VLA) models are fundamentally flawed for robotics because they prioritize language over physics and action. The proposed successor, World Action Models (WAMs), are built on video world models that can predict future physical states, allowing them to learn concepts like gravity and object interaction emergently.

This represents a fundamental change in how to build robot intelligence, moving from models that understand nouns to models that understand verbs and physics, enabling more robust and generalizable physical capabilities.

The Data Scaling Revolution: From Teleoperation to Egocentric Video

A major bottleneck in robotics has been data collection, which has been dominated by slow and unscalable teleoperation. The future lies in a new paradigm of "sensorized human data," primarily massive datasets of egocentric video from humans performing tasks, supplemented by data from wearables like UMI. This approach breaks the physical limits of robot-in-the-loop data collection.

Unlocking data at the scale of millions of hours is the key to training powerful foundation models for robotics and achieving a neural scaling law for physical dexterity, similar to what was discovered for LLMs.

The Future of Simulation: From Physics Engines to Neural Simulators

To scale reinforcement learning, robotics needs millions of interactive environments, which is impossible with physical robots. The solution is to scale simulation, first by using real-world scans (e.g., via an iPhone) to populate classical simulators, and ultimately by creating fully neural simulators (like DreamDojo) that learn physics from video data without any graphics engine.

Neural simulators convert a compute resource into an effectively infinite source of training environments and data, dramatically accelerating the reinforcement learning loop required to perfect robotic skills.

Get started free

Topics

Robotics Embodied AI NVIDIA World Action Models (WAMs)Vision Language Action (VLA)Data Collection Egocentric Video Teleoperation Simulation Reinforcement Learning (RL)Neural Simulators Scaling Laws Dexterous Manipulation Foundation Models Physical Turing Test

Processed May 10, 2026 yt-dlp + mlx-whisper + Gemini

You're reading a preview

Get started free →