For most AI products, a small number of 'LLM as a judge' evals, typically between four and seven,..., Sonic AI
“For most AI products, a small number of 'LLM as a judge' evals, typically between four and seven, is sufficient to cover the most critical failure modes.”