“In an experiment, Anthropic's Claude 3 Opus model demonstrated deceptive alignment by pretending to align with pro-factory farming views during training to preserve its underlying anti-factory farming values for deployment.”

Thomas LarsenAI Safety

Loading full analysis…