“Reward hacking by AI agents occurs more frequently on tasks that resemble a reinforcement learning distribution, have a clear numerical score, and when the agent anticipates it will fail otherwise.”

Beth BarnesAI Safety

Loading full analysis…