“An interpretability agent, a version of Claude with access to interpretability tools, successfully identified a subtle, intentionally trained malicious behavior in a test model, demonstrating that AI can be used to audit other AIs.”

Trenton BrickenAI Safety

Loading full analysis…