The Cognitive Revolution• Mar 5, 2026• 1:49:52Interview

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

From The Cognitive Revolution

Dan Balsam & Tom McGrath•CTO & Chief Scientist, Goodfire

Executive Summary

Mechanistic interpretability startup Goodfire has raised a $150M Series B at a $1.25B valuation, signaling significant commercial and scientific interest in understanding and controlling AI models.
Goodfire introduced a new research paradigm called 'intentional design,' which aims to proactively shape what models learn during training, moving beyond simply reverse-engineering them after the fact.
A key proof-of-concept for intentional design is a technique that reduces hallucinations by using a probe on a frozen copy of the model during reinforcement learning, making it harder for the model to evade detection.
The episode highlights real-world applications of interpretability, including a collaboration with Prima Mente that uncovered a novel biomarker in an Alzheimer's prediction model, demonstrating its value for scientific discovery.

8 quotes

Concerns Raised

The risk that models will learn to evade monitoring probes (reward hacking or obfuscation) rather than changing their behavior.
The immaturity of 'intentional design' techniques, which are not yet considered safe for use on frontier AI models.
The inherent difficulty and vast complexity of fully understanding the internal workings of large neural networks.

Opportunities Identified

Using 'intentional design' to proactively build safer, more reliable, and more controllable AI systems.
Applying interpretability as a tool for scientific discovery, capable of auditing complex models and uncovering novel mechanisms in fields like healthcare.
Improving model performance and efficiency by identifying and removing non-essential components, such as 'memorization weights'.
Developing a commercial 'interpretability powered stack' for enterprise use in auditing, monitoring, and controlling AI models.

Key Themes

The Evolution of Mechanistic Interpretability

The field of mechanistic interpretability is advancing from simply reverse-engineering trained models to actively shaping their learning processes. This shift, exemplified by Goodfire's 'intentional design' agenda, moves from identifying concepts and circuits to controlling how models generalize and what they learn during training.

This evolution is crucial for building more reliable, controllable, and ultimately safer AI systems, moving from passive observation to active engineering of model behavior.

Intentional Design and Model Control

Goodfire introduces 'intentional design' as a new paradigm for AI development, focusing on proactively controlling what models learn by shaping the loss landscape during training. This contrasts with post-hoc analysis and aims to build desired properties into models from the ground up, such as reducing hallucinations without 'fighting backpropagation'.

This approach could lead to more robust and aligned AI by making it easier for models to learn desired behaviors than to develop unwanted ones or evade safety mechanisms.

The Challenge of Reward Hacking and Evasion

A central challenge in AI safety is 'reward hacking,' where a model learns to fool its monitoring systems instead of correcting its behavior. The episode details Goodfire's technique of using a probe on a frozen, separate model during training to make genuine behavioral change the path of least resistance, thus mitigating evasion.

Overcoming reward hacking is a critical step toward building trustworthy AI. The techniques discussed represent a practical, though still early, attempt to solve this fundamental alignment problem.

Commercializing Interpretability for Real-World Impact

The discussion highlights the successful commercialization of advanced AI research, evidenced by Goodfire's unicorn valuation and work with clients in life sciences and finance. A case study on an Alzheimer's model reveals how interpretability can yield concrete scientific insights, such as identifying a model's reliance on DNA fragment length.

The commercial viability of interpretability provides a powerful incentive and funding mechanism for advancing the science, accelerating progress toward safer and more effective AI for practical applications in science and enterprise.

Get started free

Topics

Mechanistic Interpretability Intentional Design AI Safety AI Alignment Hallucination Reduction Reward Hacking Probe-based Training Model Control Loss Landscape AI in Healthcare Alzheimer's Research Model Auditing Venture Capital AI Startups Goodfire

Processed Apr 3, 2026 yt-dlp + mlx-whisper + Gemini