“There is still a significant performance gap where AI models perform better on tasks with verifiable outcomes compared to non-verifiable tasks.”