“Anthropic has developed an interpretability agent that can find circuits in language models and successfully pass the 'auditing game' safety evaluation by identifying what is wrong with a modified model.”

SholtoAI Safety

Loading full analysis…