In developing the HealthBench benchmark, OpenAI found that its model-based grader was more accura..., Sonic AI
“In developing the HealthBench benchmark, OpenAI found that its model-based grader was more accurate at evaluating AI responses than the average human physician grader.”