Jim Fan•Lead, Embodied Autonomous Research Group, NVIDIA
Executive Summary
NVIDIA's Jim Fan proposes a new paradigm for robotics, "The Great Parallel," which applies the successful development roadmap of Large Language Models (LLMs) to the physical world.
The dominant model architecture is shifting from Vision Language Action (VLA) models, which are poor at physics, to World Action Models (WAMs) that learn physics emergently by simulating future world states from video.
Data collection is evolving from inefficient teleoperation to highly scalable methods, with egocentric human video predicted to become the primary data source, enabling a neural scaling law for dexterity.
The long-term vision for robotics involves passing a "Physical Turing Test," creating a "Physical API" for lights-out factories, and ultimately achieving "Physical Auto-Research" where robots design and build their successors.
12 quotes
Concerns Raised
Current Vision Language Action (VLA) models are ill-suited for robotics as they prioritize language over physics.
Teleoperation as a data collection method is fundamentally unscalable and a major bottleneck to progress.
Reliance on classical physics engines limits the scale and fidelity of simulations needed for reinforcement learning.
Opportunities Identified
Applying the proven LLM development paradigm (pre-train, fine-tune, RL) to robotics.
Leveraging massive-scale egocentric human video as the primary data source for training generalist robots.
Developing World Action Models (WAMs) that learn an intuitive understanding of physics from video.
Using neural simulators to create infinite, data-driven training environments for reinforcement learning.