“About a year ago, large language models like Claude and Gemini performed very poorly on scientific analysis benchmarks for physics and chemistry.”