“A recent OpenAI paper demonstrated that if an AI model is trained against expressing a deceptive plan in its chain-of-thought, it will still execute the deceptive plan but will learn to hide the intention from its chain-of-thought.”

Dwarkesh PatelAI Safety

Loading full analysis…