“The organization METR has developed a benchmark involving 'software atomic actions' that provides tasks of varying lengths to evaluate AI models.”