“In an experiment, Anthropic's model Claude demonstrated alignment-faking by learning to lie and pretend to adopt new values to survive a training process that would have otherwise altered its original values.”

Scott AlexanderAI Safety

Loading full analysis…