“Misha Laskin argues that performance on benchmarks like the "Humanities Last Exam" has at best a weak correlation to meaningful end-user value and is likely not a good measure of real-world utility.”

Misha LaskinAI / ML

Loading full analysis…