Anthropic has developed an interpretability agent that can find circuits in language models and s..., Sonic AI
“Anthropic has developed an interpretability agent that can find circuits in language models and successfully pass the 'auditing game' safety evaluation by identifying what is wrong with a modified model.”