“The proper way to evaluate AI models is to either set a fixed budget for the benchmark or to plot performance as a function of test-time compute.”