Project · AI Evals
A field guide to measuring whether your LLM-powered product actually works — and how to make it better, deliberately, over time.
A few years ago, “does it work?” was a yes/no question. Software either ran or it didn’t. With LLMs, that’s no longer true.
AI systems fail probabilistically. The same input can produce a great answer one day and a wrong one the next. The bug isn’t a stack trace — it’s a tone shift, a hallucinated citation, an unhelpful refusal, a confidently wrong number. Traditional QA can’t catch this. You need a different muscle.
That muscle is evaluations: structured, repeatable measurements of what a system produces against what you wanted it to produce. They’re the closest thing AI engineering has to unit tests, but they look more like product research than software testing — and the practice of running them well is its own discipline.
This page is a field guide. The framework comes from work on Reddit Answers; the patterns generalize.
Most arguments about “evals” are actually arguments about scope. There are three, and they answer different questions.
Is the model smart?
MMLU, HumanEval, GPQA, GSM8K. Generic, public, run by model providers. Useful for selecting a model — close to useless for product decisions.
Does it work for our use case?
Custom prompt bank, written rubric, LLM-as-judge or human review. Owned by the product team. Run before each ship. The Reddit Answers work was here.
Does it work in production?
Sampled real traffic, user signals, continuous scoring. Catches drift, distribution shift, and novel failure modes the offline bank never imagined.
Capability evals tell you what to buy. Online evals tell you what to fix. Application evals — the middle scope — are where most product teams should live.
The application eval loop is six steps. Almost every product team building on LLMs ends up at something close to this shape.
Don’t aggregate to one number. Score by slice — topic, query length, intent. The mean is rarely the interesting number; the slice that’s failing is.
The hardest part of an eval system isn’t the loop — it’s the judge. The judge is itself a model. It has biases. It has to be calibrated before its scores can be trusted.
Rather than “rate quality 1–5,” score along separate dimensions: faithfulness (grounded in source), answer quality (addresses the ask), tone & format (matches spec), safety (refusal correctness). Single-number scores hide which dimension is failing — which is the whole reason you’re evaluating.
Comparing prompt v3 to v4? Ask the judge “is A better than B?” rather than scoring each independently. Pairwise is far less noisy and the right shape for tuning.
The fix is calibration. Take ~100 eval items, have a human label them, and check that the judge agrees with the human at least 80% of the time. If it doesn’t, the judge prompt is the problem — not the system you’re trying to test.
Reddit Answers retrieves real Reddit content and summarizes it through an LLM. We built an evaluation system around it from scratch. The shape of that system is the most useful artifact I kept from the project.
Around 200 hand-written questions sampled across topics — finance, hobbies, controversial debate, niche communities, evergreen explainers. Each had a written rubric for what a good answer looked like before any model ran. This step is the one most teams skip; it’s also the one that pays off most.
Initial agreement with human reviewers was lower than we wanted. Tightening the rubric prompt and shifting to pairwise comparison brought judge–human agreement up to a level we trusted. The judge prompt itself became a versioned artifact, evaluated alongside the system it was scoring.
Failing answers fell into three buckets:
The eval bank ran on every candidate change. Regressions on critical slices — faithfulness on factual questions, tone on opinion ones — blocked release. Wins on one slice that broke another were the most common, and most important, finding.
The eval set wasn’t just a quality check — it became the spec. The team stopped arguing about “is the answer good?” and started arguing about “what should the rubric say?” A much more productive conversation.
Evals are diagnostic. They only matter if they’re wired to a change. There are five layers of change worth knowing — lighter on top, heavier and more compounding below.
Most teams over-iterate on the application layer because it’s fast, and underinvest in the bottom three. The durable wins live below the application layer — in data flywheels and production discipline.
The loop is the same regardless of the product. The shape of the prompt bank, the rubric, and the dominant change lever vary.
Bank — real ticket transcripts, sampled across product areas.
Rubric — resolution, escalation appropriateness, voice adherence, policy faithfulness.
Common lever — routing. Categorize intent first, then specialize the prompt or knowledge base.
Bank — questions paired with the source chunks that contain answers.
Rubric — faithfulness, retrieval recall, completeness.
Common lever — retrieval. Chunking, hybrid search, reranking. Generation rarely is the bottleneck; retrieval almost always is.
Bank — source documents paired with reference summaries.
Rubric — coverage, faithfulness, length adherence, redundancy.
Common lever — prompt structure and length control. Summarization is unusually sensitive to few-shot examples.
The mature end-state is when evals stop being a project and start being a system component — wired into how the product ships.
Every prompt, model, or retrieval change runs against the eval bank before merge. Regressions on critical slices block the deploy automatically — same shape as a unit test.
The new configuration runs silently against real production queries. Outputs are scored but never shown to users. Confirms behavior under real distribution before any user sees it.
Ship to 1–5% of traffic with online metrics tied to eval rubric dimensions. If the canary’s faithfulness score drops, auto-rollback before the change is widely visible.
Continuously sample production traffic, score with the same judge, surface drift. The eval bank itself evolves from the production tail — failures from the wild become tomorrow’s regression tests.
This is what separates teams treating AI as a product feature from teams treating it as infrastructure.
Things that look like eval problems but are usually something more specific.
The system you ship is whatever your evals incentivize. So the evals had better be honest.
An eval system is a mirror. It tells you what you’ve actually built — not what you wished you’d built. The teams that take that seriously end up with products that get measurably better over time. The teams that don’t end up with products that feel like they should be working, while users quietly route around them.
The framework here is illustrative, drawn from work on Reddit Answers and adjacent product builds. The patterns generalize, but every system has its own quirks worth respecting. The first prompt bank is always wrong. The first rubric is always too coarse. That’s fine — both are meant to be iterated.