The current approach of fine-tuning LLMs pre-trained on static benchmarks is insufficient for creating advanced, 'agentic' AI systems capable of complex, multi-step tasks.
A fundamental shift in pre-training is required, focusing on curated data (especially 'reasoning traces'), novel loss objectives, and potentially new architectures to build in core capabilities like planning and long-context reasoning from the start.
The field is moving beyond saturated static benchmarks (like MMLU) towards dynamic, workflow-oriented evaluations (like Sweebench) that better measure an agent's ability to interact with an environment and solve problems over time.
The guest's company, Reflection, is building 'Frontier Open Agentic models' from the ground up, aiming to pioneer these new pre-training methods to create a step-change in AI agent capabilities.
12 quotes
Concerns Raised
Current post-training methods are a limiting factor for achieving next-generation agentic capabilities.
Existing static benchmarks are inadequate for measuring and guiding progress in agentic AI.
Long-context reasoning remains a difficult, unsolved problem for current model architectures.
Scaling the use of synthetic data for reasoning traces is challenging and risks model degradation if not handled carefully.
Opportunities Identified
Fundamentally rethinking pre-training offers a path to a step-change in AI capabilities.
Developing new, dynamic, and workflow-representative benchmarks will accelerate progress in the field.
High-quality data curation and the generation of 'reasoning traces' can unlock more powerful and efficient models.
Building open, frontier-level agentic models from scratch presents a significant opportunity to advance the field.