Identifying the right types of data to scale for robustness and efficiency, not just task diversity.
The computational trilemma of balancing inference speed, context length, and model size for real-time robotics.
The inherent difficulty for models to learn from passive video data (like YouTube) without the focusing mechanism of goal-directed physical interaction.
The challenge of making simulation effective for learning, as it primarily allows for rehearsal rather than acquiring new knowledge about the world.
Opportunities Identified
Initiating a data collection 'flywheel' by deploying robots for useful, narrow tasks in the near term (1-2 years).
Leveraging pre-trained vision-language models (VLMs) to provide a strong foundation of prior knowledge and common sense.
The emergence of complex, untrained behaviors through compositional generalization as models scale.
Developing hybrid inference systems where robots perform reactive tasks locally and offload complex reasoning to the cloud.