“In practice, running self-play with large language models for an extended period results in a performance plateau where the model ceases to improve.”