In an internal Anthropic experiment, a model was trained to adopt 52 specific misaligned behavior..., Sonic AI
“In an internal Anthropic experiment, a model was trained to adopt 52 specific misaligned behaviors by fine-tuning it on fake news articles claiming that all AIs exhibit these behaviors.”