TwiML AI Podcast• Dec 17, 2025• 51:54Interview

Rethinking Pre-Training for Agentic AI [Aakanksha Chowdhery] - 759

From TwiML AI Podcast

Aakanksha Chowdhery•Member of Technical Staff, Reflection

Executive Summary

The current approach of fine-tuning LLMs pre-trained on static benchmarks is insufficient for creating advanced, 'agentic' AI systems capable of complex, multi-step tasks.
A fundamental shift in pre-training is required, focusing on curated data (especially 'reasoning traces'), novel loss objectives, and potentially new architectures to build in core capabilities like planning and long-context reasoning from the start.
The field is moving beyond saturated static benchmarks (like MMLU) towards dynamic, workflow-oriented evaluations (like Sweebench) that better measure an agent's ability to interact with an environment and solve problems over time.
The guest's company, Reflection, is building 'Frontier Open Agentic models' from the ground up, aiming to pioneer these new pre-training methods to create a step-change in AI agent capabilities.

12 quotes

Concerns Raised

Current post-training methods are a limiting factor for achieving next-generation agentic capabilities.
Existing static benchmarks are inadequate for measuring and guiding progress in agentic AI.
Long-context reasoning remains a difficult, unsolved problem for current model architectures.
Scaling the use of synthetic data for reasoning traces is challenging and risks model degradation if not handled carefully.

Opportunities Identified

Fundamentally rethinking pre-training offers a path to a step-change in AI capabilities.
Developing new, dynamic, and workflow-representative benchmarks will accelerate progress in the field.
High-quality data curation and the generation of 'reasoning traces' can unlock more powerful and efficient models.
Building open, frontier-level agentic models from scratch presents a significant opportunity to advance the field.

Key Themes

Rethinking Pre-training for Agentic AI

The conversation argues that achieving true agentic capabilities like planning and multi-step reasoning requires moving beyond post-training tweaks. Instead, these skills must be embedded during the pre-training phase by fundamentally re-engineering the training data, loss objectives, and potentially the model architecture.

This challenges the prevailing industry approach of simply fine-tuning general-purpose models. It suggests the next major leap in AI capability will come from building agentic skills into the foundational models themselves, not just layering them on top.

The Evolution of AI Benchmarking

Static benchmarks are becoming saturated and are poor measures of agentic intelligence. The discussion highlights a necessary shift towards dynamic, multi-step benchmarks that reflect real-world workflows, such as coding (Sweebench) or complex research tasks, to properly evaluate planning and reasoning.

As the saying goes, 'you get what you measure.' The development of new, more representative benchmarks is critical for guiding research and accurately assessing progress in agentic AI, moving the goalposts from simple Q&A to complex problem-solving.

The Primacy of Data Quality and Composition

The quality and composition of training data are paramount for building powerful models. The discussion emphasizes the importance of high-quality data curation and the need for more 'reasoning traces' in pre-training data, exploring the potential of synthetic data to augment this at scale without corrupting the data distribution.

This indicates that the frontier of AI development is not just about more data or compute, but about smarter, more strategic data. The ability to source or generate high-quality reasoning data is becoming a key competitive differentiator for model performance.

Reasoning as a Core Agentic Bottleneck

A key limitation of current models is their ability to perform robust, multi-hop reasoning over long contexts. This capability is essential for agents to plan, learn from past actions, and synthesize information over extended interactions, and may not be optimally developed by the standard next-token prediction objective.

Solving the long-context reasoning problem is a critical hurdle for creating agents that can move beyond simple, reactive tasks to performing complex, goal-oriented workflows autonomously and reliably.

Get started free

Topics

Agentic AI LLM Pre-training Foundation Models AI Benchmarking Multi-step Reasoning Long-context Reasoning Synthetic Data Data Curation Transformer Architecture Loss Objectives Coding Agents Tool Use Reinforcement Learning Open Source AI Reflection (company)

Processed Apr 2, 2026 yt-dlp + mlx-whisper + Gemini