Building eval systems that improve your AI product

SUMMARY

Hamel Husain and Shreya Shankar outline a playbook for AI evaluation systems, emphasizing error analysis to identify failures, building reliable evaluators, and operationalizing for continuous product improvement.

STATEMENTS

Many AI evaluation dashboards fail to drive product improvements because their metrics are disconnected from real user problems, leading to ignored outputs.
Error analysis is essential to ground evaluations in reality by systematically identifying how AI products fail in specific contexts, rather than starting with generic metrics like hallucination or toxicity.
Designating a single principal domain expert as the quality arbiter provides consistent judgments and avoids conflicts from multiple annotators in smaller teams.
Open coding involves a domain expert reviewing user interactions, writing free-form critiques, and assigning binary pass or fail judgments to uncover unexpected issues.
For passes in open coding, explain why the AI met the user's primary need despite potential improvements; for failures, detail why the main objective was not achieved.
Axial coding transforms chaotic critiques into a prioritized taxonomy of failure modes by grouping similar errors, ideally limiting to under 10 categories for manageability.
LLMs can assist in initial categorization during axial coding but require human expert review to capture nuances and avoid over-categorization.
Objective failure modes should use code-based evaluators for deterministic checks, while subjective ones require LLM-as-a-judge systems trained on human-labeled data.
To build trustworthy LLM judges, create ground-truth datasets with binary labels and detailed critiques from the domain expert, avoiding Likert scales for clarity.
Evaluate complex systems like multi-turn conversations at the session level first, then isolate root causes by attempting to reproduce failures in single turns.

IDEAS

Starting evaluations with fashionable metrics like hallucination often tracks irrelevant scores that don't align with actual user pain points in the product.
Appointing a "benevolent dictator" as the principal domain expert eliminates annotation paralysis and ensures deeply informed, consistent quality signals.
Reviewing AI outputs through error analysis helps teams discover their true success criteria, as people struggle to specify complete AI requirements upfront.
Binary pass/fail judgments with detailed critiques capture nuance more effectively than subjective scales, forcing clear definitions of quality bars.
Off-the-shelf metrics like toxicity can be repurposed to sort traces and reveal hidden patterns, rather than being reported directly on dashboards.
Splitting ground-truth data into train, dev, and test sets prevents overfitting in LLM judges, allowing unbiased final performance measurement.
Focusing on true positive and true negative rates over accuracy prevents misleading evaluations in imbalanced datasets where passes dominate.
For conversations, diagnosing failures by first testing in single turns distinguishes simple knowledge gaps from true contextual memory issues.
In RAG pipelines, prioritizing retriever recall over precision ensures generators have access to necessary facts, as LLMs can't invent missing information.
A transition failure matrix visualizes breakdowns in agent workflows, turning complex debugging into targeted analysis of state transitions.

INSIGHTS

Effective AI evaluations must originate from user-specific failure modes identified through error analysis, ensuring metrics directly address real-world product weaknesses rather than abstract ideals.
A single domain expert's consistent judgments form the bedrock of reliable evals, streamlining decisions and revealing nuanced quality standards that emerge iteratively.
Binary labeling paired with explanatory critiques provides scalable precision for subjective assessments, avoiding the vagueness of graduated scales while preserving essential context.
Rigorous validation of LLM judges using split datasets and error-rate corrections builds team trust, transforming automated evals into dependable tools for product iteration.
Disaggregating evaluations in multi-component systems like RAG or agents prevents conflating issues, enabling precise fixes that amplify overall system reliability.
Integrating evals into CI pipelines and production monitoring creates a self-reinforcing flywheel, catching regressions early and fostering sustained AI product evolution.

QUOTES

"You cannot know what to measure until you systematically find out how your product fails in specific contexts."
"Binary decisions force clarity and output either meets the quality bar or it does not."
"A judge that always predicts pass will be 99% accurate, but it will never catch a single failure."
"Recall is paramount because if the correct information is not retrieved, the generator has no chance of producing a correct answer."
"This transforms the overwhelming task of debugging a complex agent into a focused data-driven investigation."

HABITS

Designate a principal domain expert to serve as the consistent arbiter of AI quality judgments across evaluations.
Regularly sample around 100 representative user interactions for error analysis, starting with random selection to build intuition.
Perform open coding by writing detailed critiques for each interaction, ensuring explanations are clear enough for new employees or LLM prompts.
Use spreadsheets or annotation tools to tag and group critiques during axial coding, reviewing LLM suggestions manually for accuracy.
Split human-labeled data into train, dev, and test sets when building LLM judges to iteratively refine prompts without overfitting.

FACTS

Hamel Husain and Shreya Shankar have trained over 2,000 product managers, engineers, and leaders from more than 500 companies in AI evaluations.
Their course, AI Evals for Engineers & PMs, is the highest-grossing on Maven, attracting students from major AI labs like OpenAI and Anthropic.
Research indicates that people are poor at specifying complete AI requirements upfront, with true criteria emerging only through reviewing outputs.
Modern large language models excel at ignoring irrelevant noise in context but cannot generate correct answers from entirely missing information.
In imbalanced AI datasets where successes vastly outnumber failures, accuracy metrics can be 99% high yet fail to detect any real issues.

REFERENCES

Aman Khan's post on evals (linked in show notes).
AI Evals for Engineers & PMs course on Maven.
A Field Guide to Rapidly Improving AI Products by Hamel Husain.
Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences (arXiv paper).
Lenny's Newsletter: Beyond vibe checks: A PM’s complete guide to evals.
Hamel Husain's blog: Frequently Asked Questions About AI Evals and Not Dead Yet: On RAG.

HOW TO APPLY

Select a representative dataset of about 100 user interactions, starting with random sampling or focusing on negative feedback traces to begin error analysis.
Have the principal domain expert conduct open coding by reviewing each interaction, assigning a binary pass or fail, and writing detailed critiques explaining the judgment.
Perform axial coding by grouping similar critiques into categories, using a spreadsheet to tag errors and aiming for under 10 primary failure modes, with human validation of any LLM-assisted grouping.
For each prioritized failure mode, determine if it's objective (use code-based checks like JSON validation) or subjective (build an LLM-as-judge with prompts grounded in expert critiques).
Operationalize the evaluation suite by integrating it into CI pipelines for pre-shipment checks and production monitoring for ongoing tracking, creating a flywheel for continuous improvement.

ONE-SENTENCE TAKEAWAY

Build AI evaluations through error analysis and reliable judges to drive targeted, continuous product enhancements.

RECOMMENDATIONS

Avoid generic metrics like hallucination; instead, use error analysis to identify product-specific failure modes first.
Appoint a single principal domain expert early to establish a consistent quality standard and streamline annotations.
Opt for binary pass/fail labels over Likert scales to enforce clear judgments while capturing nuance in critiques.
Validate LLM judges by measuring true positive and negative rates on held-out test data, correcting for known biases.
For complex systems like RAG, evaluate retriever recall separately from generator faithfulness to pinpoint breakdowns accurately.

MEMO

In the rapidly evolving world of AI product development, evaluations have emerged as a linchpin for success, yet most dashboards remain ornamental relics, divorced from the gritty realities of user experience. Hamel Husain and Shreya Shankar, veterans who have schooled over 2,000 professionals at more than 500 companies, argue that true progress demands a shift from vanity metrics to rigorous error analysis. Their playbook, distilled from methodologies at labs like OpenAI and Anthropic, begins with understanding failure—not chasing trendy benchmarks like toxicity rates that often miss the mark on actual user frustrations.

The foundation lies in appointing a "principal domain expert," a singular voice of authority—be it a psychologist for mental health bots or a lawyer for legal tools—to cut through subjective noise. Armed with roughly 100 sampled user interactions, this expert embarks on open coding: a structured journaling of critiques, deeming outputs as pass or fail with exhaustive explanations. In one vivid example from an apartment leasing assistant, a bot's failure to escalate to a human when demanded reveals deeper handoff flaws, while successful rescheduling confirms core functionality. This process unearths unanticipated desires, as research underscores humanity's knack for vague AI specs that sharpen only through iteration.

From these raw insights springs axial coding, alchemizing disorder into a taxonomy of failures—conversation flow hiccups, handoff lapses, rescheduling snags—prioritized by frequency. Beware the pitfalls: over-reliance on LLMs for categorization risks nuance blindness, and category proliferation dilutes focus. With under 10 modes charted, teams pivot to building evaluators. Objective bugs, like missing user IDs, yield to code assertions—swift and infallible. Subjective realms, such as tonal misfits, summon LLM judges, meticulously aligned via human-grounded datasets split into training, development, and test cohorts to sidestep overfitting.

For intricate architectures, bespoke strategies prevail. Multi-turn dialogues demand session-level verdicts, with single-turn recreations isolating contextual collapses from mere knowledge voids. RAG pipelines fracture into retriever scrutiny—prioritizing recall to ensure vital documents surface—and generator checks for faithfulness, lest hallucinations derail even solid retrievals. Agents, those chain-reacting marvels, benefit from transition failure matrices, spotlighting breakdowns in workflows like SQL generation to execution, rendering chaos diagnosable.

Operationalizing this suite forges a virtuous cycle: CI safety nets snag regressions pre-launch, while production oversight fuels perpetual refinement. Husain and Shankar's counsel resonates in an era where AI's promise hinges on trust—evals not as afterthoughts, but as the engine of enduring improvement, bridging the chasm between code and human need.