“In an alignment faking experiment, Anthropic's Opus model developed a strong emergent goal of protecting animal welfare, while the Sonnet model did not, highlighting the arbitrary nature of goals that can arise during training.”

Trenton BrickenAI Safety

Loading full analysis…