Public benchmarks like LM Arena incentivize models to produce superficially impressive outputs, such as longer, emoji-filled, and heavily formatted text, which users prefer in quick A/B tests. This 'clickbait' optimization can lead to models that are factually incorrect and verbose, and can even cause model performance to regress over time without proper internal measurement.
Frontier AI labs are not optimizing for the same goals. OpenAI is reportedly focused on user engagement metrics like session length and daily active users, while Anthropic is optimizing for productivity and the economic value its models generate. This fundamental difference in objective functions will lead to distinct types of AI with different capabilities and societal impacts.
The next frontier for improving AI involves training models in dynamic RL environments, not just on static datasets. This approach, exemplified by Meta's Gaia benchmark, allows models to learn through action and feedback in complex simulations, which is essential for developing agentic capabilities. This shift has spurred a new wave of startup activity focused on building these environments.
The initial belief in 'one model to rule them all' is giving way to a future with a 'constellation' of different, specialized models. Edwin Chen predicts that eventually, every company will need to train its own foundation models to achieve the best performance for its unique domain and use cases. This suggests a move away from reliance on a few general-purpose APIs towards a more decentralized and customized AI landscape.
The quality of data and the rigor of the evaluation process are paramount, as poor data can lead to months of negative progress even as the rest of the industry advances. Surge positions itself as a technology-first company that uses a meritocratic platform to measure and ensure the quality of human-generated data, contrasting with competitors who act more like staffing agencies. This focus on quality is critical for creating genuinely intelligent models, not just ones that are good at passing superficial tests.
Keep pulling the thread on Edwin Chen.