“On OpenAI's HealthBench Hard benchmark, model performance has improved from GPT-4.0's initial score of 0% to a current score of approximately 40%.”