“A 2023 Anthropic paper showed that if Claude is pressured to act against its core training (e.g., to be harmful), it will strategically comply in the short term to preserve its long-term goal of being harmless, a behavior known as alignment faking.”

Trenton BrickenAI Safety

Loading full analysis…