Evaluating AI Systems

01“It Works on My Machine” Doesn’t Apply

A few years ago, “does it work?” was a yes/no question. Software either ran or it didn’t. With LLMs, that’s no longer true.

AI systems fail probabilistically. The same input can produce a great answer one day and a wrong one the next. The bug isn’t a stack trace — it’s a tone shift, a hallucinated citation, an unhelpful refusal, a confidently wrong number. Traditional QA can’t catch this. You need a different muscle.

That muscle is evaluations: structured, repeatable measurements of what a system produces against what you wanted it to produce. They’re the closest thing AI engineering has to unit tests, but they look more like product research than software testing — and the practice of running them well is its own discipline.

This page is a field guide. The framework comes from work on Reddit Answers; the patterns generalize.

02Three Scopes of Eval

Most arguments about “evals” are actually arguments about scope. There are three, and they answer different questions.

Scope 01

Capability

Is the model smart?

MMLU, HumanEval, GPQA, GSM8K. Generic, public, run by model providers. Useful for selecting a model — close to useless for product decisions.

Scope 02 · Where Product Lives

Application

Does it work for our use case?

Custom prompt bank, written rubric, LLM-as-judge or human review. Owned by the product team. Run before each ship. The Reddit Answers work was here.

Scope 03

Online

Does it work in production?

Sampled real traffic, user signals, continuous scoring. Catches drift, distribution shift, and novel failure modes the offline bank never imagined.

Capability evals tell you what to buy. Online evals tell you what to fix. Application evals — the middle scope — are where most product teams should live.

03The Eval Loop

The application eval loop is six steps. Almost every product team building on LLMs ends up at something close to this shape.

Prompt BankCurated test inputs that mirror real traffic. Sample from production logs.
System Under TestThe full pipeline: prompt + model + retrieval + orchestration. Treated as one unit.
OutputsRun the bank, capture the answers, store with metadata for traceability.
JudgeScore each output against the rubric — by another LLM, a human, or both.
Failure ClustersGroup the lowest-scoring cases. Patterns reveal where the system is breaking.
Diagnose & ChangeEach cluster points at a specific lever — prompt, model, retrieval, orchestration. Fix, then re-run the loop.

Principle

Don’t aggregate to one number. Score by slice — topic, query length, intent. The mean is rarely the interesting number; the slice that’s failing is.

04Inside the Judge

The hardest part of an eval system isn’t the loop — it’s the judge. The judge is itself a model. It has biases. It has to be calibrated before its scores can be trusted.

Decompose the Rubric

Rather than “rate quality 1–5,” score along separate dimensions: faithfulness (grounded in source), answer quality (addresses the ask), tone & format (matches spec), safety (refusal correctness). Single-number scores hide which dimension is failing — which is the whole reason you’re evaluating.

Pairwise > Absolute

Comparing prompt v3 to v4? Ask the judge “is A better than B?” rather than scoring each independently. Pairwise is far less noisy and the right shape for tuning.

Biases to Defend Against

Position bias — judges prefer the option presented first. Randomize order.
Verbosity bias — longer answers read “better.” Length-normalize or constrain.
Self-preference — a model rates its own outputs higher. Use a different judge model from the generator.
Set drift — the eval bank goes stale as the product evolves. Refresh from the production tail.

The fix is calibration. Take ~100 eval items, have a human label them, and check that the judge agrees with the human at least 80% of the time. If it doesn’t, the judge prompt is the problem — not the system you’re trying to test.

05Case: Reddit Answers

Reddit Answers retrieves real Reddit content and summarizes it through an LLM. We built an evaluation system around it from scratch. The shape of that system is the most useful artifact I kept from the project.

Phase 01

The First Prompt Bank

Around 200 hand-written questions sampled across topics — finance, hobbies, controversial debate, niche communities, evergreen explainers. Each had a written rubric for what a good answer looked like before any model ran. This step is the one most teams skip; it’s also the one that pays off most.

Phase 02

LLM-as-Judge with Self-Rated Rubrics

Initial agreement with human reviewers was lower than we wanted. Tightening the rubric prompt and shifting to pairwise comparison brought judge–human agreement up to a level we trusted. The judge prompt itself became a versioned artifact, evaluated alongside the system it was scoring.

Phase 03

Failure Clustering

Failing answers fell into three buckets:

Retrieval misses — the right Reddit threads weren’t being surfaced
Tone drift — the LLM was too neutral on opinion-heavy questions where Reddit’s voice mattered
Citation faithfulness — answers paraphrased beyond what the source thread actually said

Phase 04

Each Cluster Pointed to a Different Lever

Retrieval misses — adjusted RAG ranking volume and reranker weights
Tone drift — revised the prompt and added few-shot examples to anchor voice
Citation faithfulness — added a verifier orchestration step that checked claims against source

Phase 05

Each Fix Re-Scored Before Shipping

The eval bank ran on every candidate change. Regressions on critical slices — faithfulness on factual questions, tone on opinion ones — blocked release. Wins on one slice that broke another were the most common, and most important, finding.

Soundbite

The eval set wasn’t just a quality check — it became the spec. The team stopped arguing about “is the answer good?” and started arguing about “what should the rubric say?” A much more productive conversation.

06Where Evals Connect to Change

Evals are diagnostic. They only matter if they’re wired to a change. There are five layers of change worth knowing — lighter on top, heavier and more compounding below.

Application Prompt · few-shot · orchestration steps · RAG (top-k, reranker, chunking) Reddit Work

Model Selection · cost-quality routing · SFT · DPO from your judged pairs Weeks

Data Failure cases → training data · synthetic generation · active learning Compounding

Guardrail Input/output classifiers · PII redaction · refusal training Separate Loop

Production CI gate on regression · shadow traffic · canary · online eval sampling Infrastructure

Most teams over-iterate on the application layer because it’s fast, and underinvest in the bottom three. The durable wins live below the application layer — in data flywheels and production discipline.

07Same Framework, Three Use Cases

The loop is the same regardless of the product. The shape of the prompt bank, the rubric, and the dominant change lever vary.

Use Case A

Customer Support Bot

Bank — real ticket transcripts, sampled across product areas.

Rubric — resolution, escalation appropriateness, voice adherence, policy faithfulness.

Common lever — routing. Categorize intent first, then specialize the prompt or knowledge base.

Use Case B

Document Q&A / RAG

Bank — questions paired with the source chunks that contain answers.

Rubric — faithfulness, retrieval recall, completeness.

Common lever — retrieval. Chunking, hybrid search, reranking. Generation rarely is the bottleneck; retrieval almost always is.

Use Case C

Summarization

Bank — source documents paired with reference summaries.

Rubric — coverage, faithfulness, length adherence, redundancy.

Common lever — prompt structure and length control. Summarization is unusually sensitive to few-shot examples.

08When Evals Become Infrastructure

The mature end-state is when evals stop being a project and start being a system component — wired into how the product ships.

Pre-Deploy

CI Gate

Every prompt, model, or retrieval change runs against the eval bank before merge. Regressions on critical slices block the deploy automatically — same shape as a unit test.

Pre-Deploy

Shadow Traffic

The new configuration runs silently against real production queries. Outputs are scored but never shown to users. Confirms behavior under real distribution before any user sees it.

Rollout

Canary

Ship to 1–5% of traffic with online metrics tied to eval rubric dimensions. If the canary’s faithfulness score drops, auto-rollback before the change is widely visible.

Continuous

Online Eval

Continuously sample production traffic, score with the same judge, surface drift. The eval bank itself evolves from the production tail — failures from the wild become tomorrow’s regression tests.

This is what separates teams treating AI as a product feature from teams treating it as infrastructure.

09Common Failure Modes

Things that look like eval problems but are usually something more specific.

Watch For

Optimizing for the eval. If your eval set is fixed and you keep tuning against it, you’re effectively training on the test set. Hold out a private set you only check occasionally.
Judge that doesn’t match humans. Validate the judge prompt before trusting any score it produces. 80% agreement is a reasonable bar.
Aggregate scores hiding the failure. Always slice — by topic, by user type, by query length. The mean is the least interesting number on the page.
Treating evals as a ship-blocker only. The point is iteration speed. The faster you can run the loop, the faster the system improves. Slow evals get skipped.
Confusing capability evals with product evals. A higher MMLU score does not mean your product got better. Re-run the application eval on every model swap.

10The Closing Frame

Closing

The system you ship is whatever your evals incentivize. So the evals had better be honest.

An eval system is a mirror. It tells you what you’ve actually built — not what you wished you’d built. The teams that take that seriously end up with products that get measurably better over time. The teams that don’t end up with products that feel like they should be working, while users quietly route around them.

The framework here is illustrative, drawn from work on Reddit Answers and adjacent product builds. The patterns generalize, but every system has its own quirks worth respecting. The first prompt bank is always wrong. The first rubric is always too coarse. That’s fine — both are meant to be iterated.