An interpretability agent, a version of Claude with access to interpretability tools, successfull..., Sonic AI
“An interpretability agent, a version of Claude with access to interpretability tools, successfully identified a subtle, intentionally trained malicious behavior in a test model, demonstrating that AI can be used to audit other AIs.”