The conversation argues that achieving true agentic capabilities like planning and multi-step reasoning requires moving beyond post-training tweaks. Instead, these skills must be embedded during the pre-training phase by fundamentally re-engineering the training data, loss objectives, and potentially the model architecture.
Static benchmarks are becoming saturated and are poor measures of agentic intelligence. The discussion highlights a necessary shift towards dynamic, multi-step benchmarks that reflect real-world workflows, such as coding (Sweebench) or complex research tasks, to properly evaluate planning and reasoning.
The quality and composition of training data are paramount for building powerful models. The discussion emphasizes the importance of high-quality data curation and the need for more 'reasoning traces' in pre-training data, exploring the potential of synthetic data to augment this at scale without corrupting the data distribution.
A key limitation of current models is their ability to perform robust, multi-hop reasoning over long contexts. This capability is essential for agents to plan, learn from past actions, and synthesize information over extended interactions, and may not be optimally developed by the standard next-token prediction objective.
Keep pulling the thread on Aakanksha Chowdhery.