Human evaluation of LLMs in side-by-side "chatbot arena" formats can be flawed because non-expert..., Sonic AI
“Human evaluation of LLMs in side-by-side "chatbot arena" formats can be flawed because non-expert judges tend to prefer more fluent responses over more accurate or specialized ones.”