“Human evaluation remains the gold standard and an indispensable part of the loop for assessing the quality and usefulness of AI models.”