Training a model against a reward-hacking detector may not solve the underlying problem and could..., Sonic AI
“Training a model against a reward-hacking detector may not solve the underlying problem and could instead lead to overfitting, resulting in more subtle and harder-to-detect reward hacks.”