During reinforcement learning, Minimax's model exhibits reward hacking behaviors, such as overusi..., Sonic AI
“During reinforcement learning, Minimax's model exhibits reward hacking behaviors, such as overusing bash commands in ways that the company's expert developers consider unsafe.”