Goodfire

AI Research · Tech

Mentions

Podcasts

Episodes

Podcast consensus

Points of consensus

▶Goodfire developed and uses a 'frozen model' technique, where an interpretability detector run on a static copy of a model generates a training signal for a separate, active model.Apr 2026

▶The company has publicly announced a new research agenda called 'intentional design' which aims to proactively control what models learn during training.Apr 2026

▶Goodfire's research has demonstrated the ability to distinguish between model weights used for memorization and those for general reasoning, and that removing the former can improve performance on some tasks.Apr 2026

▶Key personnel like Chief Scientist Tom McGrath advocate for a philosophy of 'not fighting backpropagation', instead seeking to shape the loss landscape to guide model learning naturally.Apr 2026

Points of debate

▶The core technique of using an interpretability signal in a training loop is described by one expert as 'the most forbidden technique' and 'extraordinarily bad', while Goodfire presents its version as a necessary safety measure to prevent a model from evading a detector.Apr 2026

▶An external expert predicts Goodfire's frozen model technique will fail catastrophically when applied to advanced AI, whereas Goodfire's Chief Scientist states that current interpretability techniques are not ready for and should not be applied to frontier systems.Apr 2026

▶Goodfire officially states it is applying its techniques to low-stakes problems like hallucinations, but external criticism frames the methods in the context of high-stakes alignment and catastrophic outcomes.Apr 2026

▶There is a contrast in the perceived purpose of Goodfire's interpretability work; critics focus on its controversial use as a direct training control mechanism, while the company defines its purpose more broadly to include scientific discovery, auditing, and intentional design.Apr 2026

Key themes

▶The 'Forbidden' Frozen Model TechniqueApr 2026

Goodfire has developed a controversial training method that uses an interpretability detector on a frozen model copy to generate a penalty signal for an active model. This technique is central to both their research and the significant external criticism they face, with some experts calling it exceptionally dangerous.

The intense debate over this technique highlights a fundamental schism in the AI safety community regarding whether interpretability tools should be used to directly control and train models or be reserved strictly for post-hoc analysis and monitoring.

▶Intentional Design as a Proactive StrategyApr 2026

The company has publicly launched a research agenda named 'intentional design,' which seeks to proactively control a model's learning process by understanding and shaping its loss landscape. This represents a strategic shift from reactive, post-hoc fixes to pre-emptive control during training.

This agenda positions Goodfire as a thought leader attempting to move beyond simple fine-tuning, but its success and influence will depend on the scalability and robustness of these landscape-shaping techniques, which are still in early, proof-of-concept stages.

▶Philosophy of 'Don't Fight Backprop'Apr 2026

A core principle articulated by Goodfire's leadership is to work with the natural learning dynamics of backpropagation rather than against them. Their techniques, such as shaping the loss landscape, are designed to make desired behaviors the path of least resistance for the model.

This philosophy could be a key differentiator, suggesting a more elegant and potentially scalable approach to AI control than adversarial training or rigid constraints, but it also relies on a deep understanding of model internals that is still developing.

▶Advancing Mechanistic InterpretabilityApr 2026

Goodfire's research reflects a broader evolution in the field, moving from identifying discrete concepts with tools like sparse autoencoders to understanding the geometric structures these concepts form within a model's latent space. Their work on separating memorization from reasoning weights exemplifies this deeper focus.

This focus on the geometry of latent space suggests Goodfire is investing in foundational research that could unlock more powerful and nuanced control methods than simply identifying and removing 'bad' features, potentially giving them a long-term R&D advantage.

Source episodes

Sentiment over time

Not enough data for timeline

Changes over time

Foundational Research Period

Goodfire conducts research demonstrating the ability to distinguish and surgically remove 'memorization weights' from a model to improve reasoning, establishing its credentials in mechanistic interpretability.

Technique Development Period

The company develops its 'frozen model' technique, using an interpretability detector on a static model to generate a penalty signal for training a live model.

Period of Public Scrutiny

Discourse about Goodfire's technique emerges in public forums, where it is heavily criticized by an expert as 'the most forbidden technique' and a potential source of catastrophic risk.

Strategic Re-framing

Goodfire formally announces its 'intentional design' research agenda, packaging its techniques as a proactive vision for controlling AI development by shaping the loss landscape.

Public Clarification

Company leaders Tom McGrath and Dan Balsam publicly explain the philosophy ('don't fight backprop') and current application of their work, specifying it is used for low-stakes problems like hallucinations and as a defense against model evasion.

Suggested prompts

How does Goodfire's 'don't fight backprop' philosophy compare to other AI alignment strategies like adversarial training or Constitutional AI? &nearr;What are the specific technical failure modes of the 'frozen model' technique that lead critics to predict catastrophic outcomes with advanced models? &nearr;How does Goodfire plan to scale its 'intentional design' approach from low-stakes problems like hallucinations to high-stakes issues like deception or power-seeking? &nearr;What evidence supports the claim that separating 'memorization weights' from 'reasoning weights' genuinely improves a model's general capabilities rather than just its performance on specific benchmarks? &nearr;

Key concepts

Frozen Model Technique 2 ep Intentional Design 1 ep Interpretability Signal 2 ep Loss Landscape 1 ep Backpropagation 1 ep Mechanistic Interpretability 1 ep Hallucination Detection 1 ep Memorization Weights 1 ep Geometric Structures 1 ep

Notable quotes

“The speaker believes Goodfire's use of an interpretability signal in a training cycle is an "extraordinarily bad" practice, referring to it as "the most forbidden technique" that could lead to catastrophic outcomes.”

Unknown · Zvi's Mic Works! Recursive Self-Improvement, Live Player Analysis, Anthropic vs DoW + More!

“Tom McGrath of Goodfire asserts that a key principle for successful model control is to avoid 'fighting backpropagation' and instead shape the loss landscape so the model naturally learns the desired behavior.”

Tom McGrath · Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

“Tom McGrath, chief scientist at Goodfire, stated that the current level of understanding of interpretability techniques should not be applied to frontier AI systems.”

Tom McGrath · Zvi's Mic Works! Recursive Self-Improvement, Live Player Analysis, Anthropic vs DoW + More!

“Dan Balsam states that Goodfire is currently applying its intentional design techniques to concrete, low-stakes problems like hallucinations, not to high-stakes alignment problems like deception.”

Dan Balsam · Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Report last updated: Apr 8, 2026

Get started free

Back to Entities Intelligence Report