The conversation posits that Large Language Models (LLMs), while powerful, are insufficient for true intelligence because they lack an understanding of the physical, 3D world. Language is described as a 'lossy' and abstract encoding of reality, inadequate for tasks requiring interaction with and navigation of physical space.
A core argument is that understanding and navigating 3D space is a more fundamental and evolutionarily ancient form of intelligence than language. This is supported by examples from animal evolution, human cognition (e.g., the difficulty of driving without stereo vision), and scientific breakthroughs like the discovery of DNA's double helix structure.
The episode introduces 'world models' as the next major paradigm in foundation models. These models are defined by their ability to understand the 3D structure, shape, and compositionality of the world, enabling them to generate complete 3D scenes from partial 2D views.
Building world models is presented as a deep-tech challenge that requires a unique fusion of expertise from both AI/machine learning and computer graphics. The discussion highlights the importance of specific technologies like Neural Radiance Fields (NeRFs) and Gaussian splatting, and the necessity of assembling a team with this rare, interdisciplinary skill set.
Keep pulling the thread on Fei-Fei Li.