Fully Connected Tokyo• Jan 23, 2026• 2:38:33

Fully Connected Tokyo: [Hands-on workshop] From 0 to automated evals

From Fully Connected Tokyo

Fully Connected Tokyo•Product Manager, Weights & Biases

Executive Summary

The key to shipping successful AI applications is moving from subjective, "vibe-based" testing to a systematic, metric-driven evaluation framework.
Rapid iteration is critical, and structured evaluations are the primary enabler, allowing teams to test and improve their applications daily rather than weekly or monthly.
The recommended approach is a phased maturity model: start with manual evaluation involving domain experts to define quality, then use LLMs to categorize failures, and only then build fully automated LLM-based judges.
Tools like Weights & Biases' Weave platform are essential for this workflow, providing tracing, observability, and a framework for both manual and automated evaluations.

11 quotes

Concerns Raised

Teams getting stuck in slow, subjective 'vibe-based' development cycles.
The difficulty of defining and measuring 'good' performance without input from domain experts.
The high cost of running large-scale evaluations, which can stifle iteration.
Prematurely adopting automated evaluations before establishing a solid, manually-validated baseline of quality.

Opportunities Identified

Accelerating AI application deployment by implementing a systematic evaluation framework.
Using observability and tracing tools like Weave to debug and understand complex AI systems.
Leveraging LLMs to automate tedious parts of the evaluation process, such as categorizing qualitative feedback.
Creating a tight feedback loop between developers and domain experts to improve application quality.

Key Themes

From Vibes to Metrics

The central argument is the need to transition from subjective, "vibe-based" assessments of AI quality to objective, quantifiable metrics. Relying on a vague feeling of whether an app is 'good' is slow, unscalable, and prevents teams from aligning on progress.

This shift is crucial for treating AI development as a rigorous engineering discipline, enabling objective progress tracking, clear go/no-go decisions for production, and faster, more focused iteration.

The Primacy of Iteration Speed

A recurring point is that the most successful AI teams are those who can iterate the fastest. The speaker highlights that teams shipping to production iterate daily, and this speed is only possible with an efficient evaluation process that provides quick feedback.

This emphasizes that the development process and tooling are as important as the AI model itself. Optimizing the feedback loop is a primary driver of success and a key differentiator for high-performing teams.

The Evaluation Maturity Model

The talk presents a clear, phased progression for how teams should approach evaluation. It advises against jumping directly to complex automation, instead advocating for starting with manual evaluation and feedback from domain experts to first define what 'good' looks like.

This provides a practical roadmap for teams, helping them avoid over-engineering their evaluation process too early or getting stuck in unscalable manual methods. It aligns the complexity of the evaluation with the maturity of the product.

The Role of the Domain Expert

The speaker identifies a common bottleneck where AI developers are not the subject matter experts for the applications they build. It is critical to identify the organizational 'tastemaker' and create efficient workflows and tools, like custom labeling interfaces, to capture their specialized feedback.

This highlights a key organizational challenge in enterprise AI. Success depends on effectively bridging the gap between technical implementation and business-specific quality standards, making the domain expert a first-class citizen in the development loop.

Get started free

Topics

AI Evaluation LLMOps Weights & Biases Weave Platform Automated Evals Manual Evaluation LLM-as-a-Judge Prompt Engineering Iteration Speed AI Product Management Marimo Notebooks Domain Expertise Observability Tracing Cost Management

Processed Apr 9, 2026 yt-dlp + mlx-whisper + Gemini