“OpenAI created a medical benchmark called HealthBench that contains 49,000 distinct evaluation criteria.”