Lenny's Podcast• Sep 25, 2025• 1:46:32Interview

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

From Lenny's Podcast

Lenny Rachitsky(Host)•Shreya Shankar(Guest)•Hamil Hussain(Guest)

Executive Summary

Building 'evals' (systematic evaluations) is the highest ROI activity for improving AI products and is becoming a core competency for product and engineering teams, according to CPOs at OpenAI and Anthropic.
The recommended process starts with manual, human-led error analysis ('open' and 'axial' coding) on production data to identify key failure modes.
These identified failure modes can then be automated for scalable testing using an 'LLM as a judge'—a separate LLM prompted to give a binary pass/fail verdict on a single, narrow issue.
Effective evals are not just for pre-production unit tests; they are crucial for online monitoring of live applications, providing a continuous feedback loop on real-world performance.

12 quotes

Concerns Raised

Teams can get bogged down in 'design by committee' for evals instead of empowering a single domain expert.
Misconception that an AI can evaluate itself without a rigorous, human-led validation process.
Poorly implemented evals (e.g., using ambiguous 1-5 scales) can be unactionable and erode team trust in the process.
Developers' criteria for 'good' and 'bad' outputs can shift over time, requiring recalibration.

Opportunities Identified

Systematically measure and improve AI product performance with confidence, moving beyond unreliable 'vibe checks'.
Create automated test suites for AI that can be integrated into CI/CD pipelines and used for online production monitoring.
Analyzing application data to build evals is described as the single highest ROI activity for AI product teams.
A small number of well-designed evals (4-7) can cover the most critical failure modes for most products.

Key Themes

The Rise of Evals as a Core Skill

Evals are shifting from an obscure machine learning concept to a fundamental skill for building successful AI products. This is driven by the need to move beyond subjective 'vibe checks' to a systematic, data-driven methodology for measuring and improving the performance of non-deterministic AI systems.

This signifies a maturation in AI product development, where structured quality assurance and iterative improvement are becoming as critical as the underlying model technology itself.

LLM as a Judge: Automating Human Judgment

A key technique discussed is using an 'LLM as a judge' to automate evaluations. This involves creating a highly-scoped prompt that asks an LLM to assess a specific, narrow failure mode and return a simple binary (pass/fail) output, effectively scaling the judgment of a human expert.

This approach provides a scalable way to create robust test suites for AI applications, enabling both CI/CD integration for pre-deployment checks and continuous online monitoring of production systems.

The Human-in-the-Loop Foundation

Despite the goal of automation, the entire evaluation process is rooted in human expertise. It begins with a domain expert (a 'benevolent dictator,' often the PM) manually analyzing data to define what 'good' looks like, and this human-labeled data serves as the ground truth for validating any automated 'LLM judge'.

This highlights that technology alone is insufficient. Deep product context and human judgment are essential for defining quality and ensuring that automated metrics accurately reflect user-facing product performance.

The Discipline of Effective Evaluation

The speakers emphasize that doing evals correctly requires discipline. Common pitfalls include using ambiguous multi-point rating scales instead of binary outputs and blindly trusting an LLM judge without validating its performance against human labels using a confusion matrix.

Poorly designed evals can be misleading, destroy trust, and lead to wasted effort. A rigorous, validated approach is necessary to create a reliable feedback signal that actionably improves the product.

Get started free

Topics

AI Product Development LLM Evaluation Evals LLM as a Judge Error Analysis Open Coding Axial Coding AI Quality Assurance Production Monitoring CI/CD for AI Model Validation Human-in-the-Loop Product Management for AI Hamil Hussain Shreya Shankar

Processed Apr 3, 2026 yt-dlp + mlx-whisper + Gemini