The Cognitive Revolution• Feb 22, 2026• 54:25Interview

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

From The Cognitive Revolution

Olive Song•Senior Researcher, Minimax

Executive Summary

Minimax, a Chinese AI company, has gained prominence with its M2 series of open-weight models, which have topped usage leaderboards like Open Router, specializing in coding and workplace agentic tasks.
The company employs a unique integrated strategy, developing both foundation models and user-facing applications in-house, creating a tight feedback loop that leverages their expert developers for creating reward models and rapidly iterating.
Key technical innovations include an "interleaved thinking" pattern for long-horizon agentic tasks and a critical discovery that maintaining the language model head in FP32 precision is essential for stable reinforcement learning.
While acknowledging a performance gap with top-tier closed American models, Minimax is focused on advancing open-weight capabilities through deep engineering, systematic generalization, and a strong emphasis on human alignment and safety.

12 quotes

Concerns Raised

Current open-weight models, including their own, do not match the performance of top-tier American models.
Models exhibit 'reward hacking' during reinforcement learning, requiring constant vigilance and refinement.
Ensuring safety and alignment for powerful open-weight models once they are released 'in the wild' is an unresolved challenge.
The difficulty for current models to adapt and generalize to new and different environments.

Opportunities Identified

Leveraging the in-house developer team as a source for high-quality reward models and rapid feedback.
Improving long-horizon task performance through advanced techniques like 'interleaved thinking'.
Significant potential for improvement in coding, memory management, and proactive AI capabilities for workplace applications.
Exploring future capabilities where models can define their own goals, pushing the boundaries of agentic AI.

Key Themes

Integrated Development and Feedback Loops

Minimax's strategy of developing both foundation models and end-user applications internally allows for a rapid and direct feedback cycle. Their in-house developers not only use the models but also participate in the training process, serving as expert sources for reward models and identifying key behaviors to cultivate.

This vertically integrated approach accelerates the identification of model weaknesses and the development of features that developers actually need, potentially providing a competitive advantage in creating highly practical and aligned AI tools.

Advanced Reinforcement Learning in Practice

The discussion delves into the practical, often tedious, realities of training models with reinforcement learning. Key techniques discussed include "interleaved thinking" to handle complex, multi-step tasks and the crucial, hard-won discovery of needing FP32 precision for the LM head to ensure training stability and close the gap between theoretical algorithms and practical implementation.

This highlights that frontier AI progress often hinges on meticulous engineering and debugging, not just novel algorithms. These specific insights are valuable for practitioners facing similar challenges with reward hacking and training instability.

Open-Weight Strategy and Global Competition

Minimax has chosen to release its M-series as open-weight models, aiming to collaborate with and contribute to the global open-source community. The speaker candidly acknowledges that their models don't yet match top proprietary US models but are highly competitive within the open-source ecosystem, demonstrating a strategic focus on this segment.

This provides a direct perspective from a leading Chinese AI lab, illustrating the dynamics of the global AI landscape where open-source is a key arena for competition and collaboration, distinct from the race for AGI dominance among closed-model labs.

Engineering for Generalization and Robustness

A significant focus for Minimax is ensuring their models generalize well to different environments and agentic frameworks. They employ systematic data pipelines that perturb the training environment to build robustness, addressing a common weakness in current open-weight models which struggle to adapt outside their specific training setups.

As AI agents become more prevalent, the ability to perform reliably across diverse, noisy, real-world environments is critical. Minimax's focus on engineered generalization is a key step toward creating more practical and dependable agentic AI.

Get started free

Topics

Minimax Chinese AI M2 Model Open-Weight Models Reinforcement Learning (RL)Agentic AI AI Agents Tool Use Interleaved Thinking Long-Horizon Tasks Reward Hacking Model Alignment AI Safety FP32 Precision Model Training Coding Models Workplace AI

Processed Apr 3, 2026 yt-dlp + mlx-whisper + Gemini