“Standard LLM benchmarks have become largely saturated, as most frontier models are capable of topping almost every available benchmark.”