Standard self-play algorithms for LLMs fail because rewarding the task-generating model (conjectu..., Sonic AI
“Standard self-play algorithms for LLMs fail because rewarding the task-generating model (conjecturer) for difficulty incentivizes it to create messy, artificially complex problems rather than useful ones.”