Reinforcement Learning has a fundamentally higher performance ceiling than Supervised Fine-Tuning, making it the superior path for pushing model capabilities, even beyond the quality of human-generated training data.
The AI industry is already in a recursive self-improvement loop, and the threshold for its acceleration is low, requiring only that models become better than humans at identifying and solving research bottlenecks.
Specialized models fine-tuned with RL are a pragmatic and often superior alternative to frontier models, offering better performance on specific tasks at a fraction of the cost and latency.
The primary factor holding back Chinese AI labs from reaching parity with the US frontier is not a lack of talent or algorithmic innovation, but a structural deficit in access to large-scale compute.
The immense cost of frontier training runs (hundreds of millions of dollars) makes post-training alignment and fixing issues like reward hacking critically important, as retraining from scratch is economically infeasible.
2017
Corbitt identifies John Schulman's PPO (Proximal Policy Optimization) algorithm as the foundational predecessor to modern RL techniques used for LLMs.
Recent Past
Corbitt notes the emergence and prominence of the GRPO algorithm, highlighting its key innovation of discarding the value model from PPO and its successful scaling by DeepSeek.
Current
Corbitt describes the current state of AI training, where frontier models are used as judges in RL setups, a cottage industry of RL environment providers is booming, and CoreWeave customers are implementing continuous learning pipelines.
Near Future (Speculative)
Based on the current trajectory of recursive self-improvement, Corbitt predicts that automated labs for physical sciences could become a major economic factor within two to three years.
Long Term (Speculative)
Corbitt predicts that in the long run, the sheer abundance of compute will render the practice of paying humans for data generation obsolete.
▶The Superiority of Reinforcement Learning for Fine-TuningMay 2026
Corbitt presents a detailed technical case for why Reinforcement Learning (RL) is a more effective fine-tuning technique than Supervised Fine-Tuning (SFT). He argues that RL structurally makes smaller, more targeted changes to a model's weights, reducing catastrophic forgetting and enabling a higher performance ceiling, even when compared to SFT with high-quality human data.
For investors, this suggests that the key value differentiator in the long term may not be access to proprietary data for SFT, but rather the expertise and infrastructure to execute complex RL-based training, elevating the importance of specialized MLOps and compute providers.
▶The Economics of AI SpecializationMay 2026
Corbitt champions the business case for using smaller, specialized models fine-tuned via RL over relying on expensive, high-latency frontier models. He highlights that customers can achieve superior performance on specific tasks with significantly lower per-token costs and reduced latency, which is often the primary driver for adoption.
This theme indicates a maturing market where a 'one-size-fits-all' frontier model approach is insufficient, creating a significant market opportunity for companies like CoreWeave that provide the tools for cost-effective model specialization.
▶Accelerating Recursive Self-ImprovementMay 2026
Corbitt holds a strong conviction that the AI industry is already in a recursive self-improvement loop, where AI tools are used to build better AI tools. He believes the threshold for this to accelerate dramatically is low—requiring a model to simply be better than the smartest humans at solving research bottlenecks—and predicts major economic impacts, like automated science labs, within 2-3 years.
Analysts should consider this a high-conviction, high-impact thesis; if Corbitt is correct, the pace of technological and economic disruption in the coming years will far exceed conventional forecasts, reordering strategic priorities across industries.
▶The Geopolitical AI Landscape and Compute ConstraintsMay 2026
Corbitt analyzes the strategies of Chinese AI labs, noting their effective use of distillation and RL with American models as judges to 'fast-follow' the frontier. However, he identifies their primary bottleneck not as talent or algorithms, but as a fundamental lack of access to sufficient compute, which prevents them from closing the gap with leading US labs.
This perspective reinforces the strategic importance of semiconductor export controls and highlights that the global AI race is, at its core, a competition for computational resources, making infrastructure providers a critical geopolitical chokepoint.