“The performance of large language models on standard benchmarks has plateaued and has not shown significant improvement for the past three years.”