Building 'evals' (systematic evaluations) is the highest ROI activity for improving AI products and is becoming a core competency for product and engineering teams, according to CPOs at OpenAI and Anthropic.
The recommended process starts with manual, human-led error analysis ('open' and 'axial' coding) on production data to identify key failure modes.
These identified failure modes can then be automated for scalable testing using an 'LLM as a judge'—a separate LLM prompted to give a binary pass/fail verdict on a single, narrow issue.
Effective evals are not just for pre-production unit tests; they are crucial for online monitoring of live applications, providing a continuous feedback loop on real-world performance.
12 quotes
Concerns Raised
Teams can get bogged down in 'design by committee' for evals instead of empowering a single domain expert.
Misconception that an AI can evaluate itself without a rigorous, human-led validation process.
Poorly implemented evals (e.g., using ambiguous 1-5 scales) can be unactionable and erode team trust in the process.
Developers' criteria for 'good' and 'bad' outputs can shift over time, requiring recalibration.
Opportunities Identified
Systematically measure and improve AI product performance with confidence, moving beyond unreliable 'vibe checks'.
Create automated test suites for AI that can be integrated into CI/CD pipelines and used for online production monitoring.
Analyzing application data to build evals is described as the single highest ROI activity for AI product teams.
A small number of well-designed evals (4-7) can cover the most critical failure modes for most products.