Unsupervised Learning• May 8, 2025• 47:00Interview

RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today

From Unsupervised Learning

Michelle Prokris•Post-Training Research Lead, OpenAI

Executive Summary

OpenAI's GPT-4.1 was developed with a primary focus on developer usability, prioritizing real-world instruction-following and user feedback over traditional academic benchmarks.
A tiered model strategy, including the cheap and fast GPT-4.1 Nano, is designed to spur wider AI adoption by addressing different points on the cost-latency curve.
Reinforcement from Finetuning (RFT) is a new, highly data-efficient offering that allows developers to push frontier capabilities on niche, verifiable problems using as few as 100 samples.
The future of model development at OpenAI is trending towards a single, general model that combines capabilities, simplifying the product line for both developers and consumers.

12 quotes

Concerns Raised

The difficulty of creating robust, real-world evaluations for long-context and complex instruction-following tasks.
Standardized benchmarks are becoming saturated and less representative of real-world model utility.
It is nearly impossible to create a single model version that pleases every user for every niche use case.

Opportunities Identified

Using Reinforcement from Finetuning (RFT) to push frontier capabilities in deep tech and other specialized domains.
Leveraging cheap, fast models like GPT-4.1 Nano to drive mass adoption of AI features.
Significant value remains to be built at the application layer on top of existing and near-future models.
Improving AI agent capabilities by solving the context bottleneck and leveraging generalized tool-use training.

Key Themes

Developer-Centric Model Development

GPT-4.1 represents a strategic shift from optimizing for benchmarks to solving real-world developer pain points. The development process was guided by an internal instruction-following evaluation built from actual API usage and user feedback, focusing on improving practical utility, formatting, and reliability.

This signals a maturation of the AI industry, where practical application and user experience are becoming more important differentiators than marginal gains on standardized tests.

The Evolution of Model Evaluation

The conversation emphasizes the limitations of static benchmarks and the growing importance of creating custom evaluations ('evals') from real-world usage. These evals are seen as having a short shelf life (approx. 3 months) and are crucial for identifying and fixing the most pressing model deficiencies, such as complex instruction following or long-context reasoning.

For companies building on AI, this highlights the need to develop internal evaluation capabilities to accurately measure model performance for their specific use cases, rather than relying solely on public leaderboards.

The Renaissance of Fine-Tuning

There is a renewed bullishness on fine-tuning, driven by new techniques like Reinforcement from Finetuning (RFT). RFT is presented as a powerful, data-efficient method (requiring as few as 100 samples) that uses similar RL processes to OpenAI's internal training, enabling developers to push frontier capabilities in specialized domains like deep tech.

This empowers startups and enterprises to solve previously intractable problems where off-the-shelf models fail, creating new opportunities for defensibility in niche, data-rich areas.

Strategic Model Tiering and Adoption

The release of a family of models (GPT-4.1, Mini, Nano) is a deliberate strategy to accelerate AI adoption. The hypothesis is that providing cheap, fast, and capable models like Nano will unlock a new wave of applications that were previously constrained by cost and latency, catering to the full spectrum of market needs.

This market segmentation strategy indicates that the AI platform war will be fought not just on peak performance but also on accessibility, cost-effectiveness, and speed, enabling a broader range of products.

Convergence Towards Generalization

Despite the focus on specialized fine-tuning and tiered models, the long-term vision is a convergence towards a single, powerful, general model. Internal research shows that combining capabilities (e.g., multimodality, tool use, conversational ability) into one model produces superior results, suggesting that current product differentiation may be a step towards a unified, simpler offering.

This trend suggests that companies should build applications with an eye towards future models that will natively integrate many capabilities, potentially making complex, multi-model scaffolding obsolete.

Get started free

Topics

GPT-4.1 OpenAI Model Evaluation Developer Experience Fine-Tuning Reinforcement from Finetuning (RFT)AI Adoption Instruction Following AI Agents Benchmarks Multimodality Context Windows Model Tiers AI Strategy Deep Tech

Processed Apr 3, 2026 yt-dlp + mlx-whisper + Gemini