“Looks Good” Isn’t Enough: Evaluating GenAI Output (GenAI Series, Part 18)

Fluent is not the same as correct. How to evaluate GenAI output properly: build a golden set, choose human, automatic or model-graded scoring, and run it as a harness.

by

Dr. Pranay Jha

June 18, 2026

9 minutes

Read Time

Generative AI Series · Part 18 of 30

TL;DR · Key Takeaways

Judging AI output by eyeballing a few answers (the “vibe check”) is how broken systems ship. Fluent and correct are not the same thing.
The foundation of real evaluation is a golden set: a fixed collection of test cases with known-good answers or clear pass criteria.
Three ways to grade: humans (accurate, slow), automatic metrics (fast, narrow), and a model as judge (scalable, biased).
Wire it into a repeatable eval harness so every prompt or model change is scored against the same tests, catching regressions before users do.

You tweak a prompt, run it once, and the new answer reads better than the old one. Ship it. That little ritual, change something, glance at one output, decide it improved, is how a startling number of AI features are built, and it is almost worthless. A single answer tells you nothing about the hundred you did not look at, and these models are masters of sounding good while being wrong. Evaluation is the unglamorous discipline that separates “it looked fine when I tried it” from “it works.” This part is a practical how-to for doing it properly without drowning in process.

One of these tells you whether a change actually helped. The other just feels like it does.

Start with a golden set

Everything in evaluation rests on one thing: a golden set, a fixed collection of representative inputs paired with what a good answer looks like. It does not need to be huge. Twenty to a hundred carefully chosen cases that mirror what your real users actually ask will tell you more than a thousand random ones. The craft is in coverage: include the common requests, the tricky edge cases, the inputs that have burned you before, and a few deliberately nasty ones. This set becomes your fixed yardstick, the thing you measure every future change against, so improvements and regressions both become visible instead of anecdotal.

The discipline that makes a golden set powerful is that it does not move. The instant you start quietly editing test cases to match the output you just got, you have lost your reference point. Keep it stable, version it, and add to it deliberately, especially every time a real failure slips through to users: turn that failure into a permanent test case so it can never silently come back. A golden set that grows from real incidents becomes a memory of every mistake your system has made, which is exactly what you want guarding the door.

Made concrete, a test case is humbler than it sounds. For a support assistant it might be three columns: the question (“Can I get a refund after 20 days?”), the context the system should retrieve (your 14-day policy), and the pass criteria (“says no, cites the 14-day window, stays polite”). That is it. Twenty rows like that, kept in a spreadsheet, already give you a real test. The mistake people make is over-engineering this step and never starting; the right move is to write ten cases this afternoon, run them, and grow the set as reality teaches you new ways to fail. A rough golden set in use beats a perfect one in planning.

Three ways to grade an answer

Once you have inputs and a sense of what good looks like, you need a way to score each output. There are three, and they trade off accuracy against cost. Human evaluation is the gold standard: a knowledgeable person reads the output and judges it against your criteria. It captures nuance nothing else can, but it is slow, expensive, and hard to repeat on every code change. Automatic metrics sit at the other extreme: when there is a single correct answer (a category, a number, an extracted field), you can check it with exact-match or a simple rule, which is instant and perfectly repeatable but only works on tasks with a crisp right answer.

The third option has become the workhorse for open-ended tasks: using a model as the judge. You give a capable model the input, the response, and a rubric, and ask it to score the response. This is cheap and scalable enough to run on hundreds of cases per change, which is its great appeal, but it inherits all the quirks and biases of a model, so it must be used with care. In practice, mature teams blend all three: automatic metrics where answers are crisp, a model judge for scale on open-ended output, and periodic human review to keep the automated scores honest. The table below maps the typical fit.

Task type	Good way to grade it
Classification / extraction	Automatic: exact match, accuracy, F1
Factual Q&A (RAG)	Check answer against source; model judge for nuance
Summaries / writing	Model judge with a rubric, spot-checked by humans
Code generation	Run the code: does it pass the tests?
Tone / safety / style	Human review, plus a model judge as a first filter

Match the grading method to the task. The crispest answer wins the cheapest metric.

Wire it into a loop

The payoff comes when evaluation stops being a one-off and becomes a harness you run on every change. The loop is simple: take your golden set, run the current system over all of it, score the results, and record a single number. Now any change you make, a new prompt, a different model, a tweaked retrieval step, gets run through the identical gauntlet, and you can see at a glance whether the score went up, down, or sideways. That turns “I think this is better” into “this scored 84% versus 79%,” which is the difference between guessing and engineering.

This loop is what makes iterating on AI feel less like alchemy. You can experiment boldly because the harness catches you when a clever change quietly breaks three edge cases while improving the one you were staring at. It also protects you over time: models get updated, prompts get edited by different people, and without a standing eval those small drifts accumulate into a system that is worse than it was six months ago and nobody noticed. The harness is your regression alarm, and building even a crude one early pays for itself almost immediately.

The same tests, every change, one comparable score. That is the whole mechanism.

Reality check: a small, real eval beats a big, fancy one you never run. I would rather a team had thirty honest test cases they check on every change than a polished benchmark suite gathering dust. The goal is not a perfect score on a leaderboard; it is a fast, repeatable signal that tells you whether yesterday’s change made your own product better or worse.

▾ Go Deeper (optional, for technical readers)

Using a model as judge is powerful and quietly treacherous, so it is worth knowing its failure modes. Judges show position bias: shown two answers to compare, many models systematically favour whichever came first (or second), regardless of quality. They show verbosity bias, rating longer, more confident-sounding answers higher even when a terse one is better. They can show self-preference, scoring text from their own model family more generously. And they are inconsistent, sometimes giving the same answer different scores on different runs.

The mitigations are practical. Use a clear, specific rubric rather than asking for a vague “rate 1 to 10,” because concrete criteria reduce drift. For comparisons, run each pair in both orders and average, which cancels position bias. Pin the judge to a low temperature for consistency, and calibrate it against a sample of human judgments so you know how far to trust it. For high-stakes decisions, treat the model judge as a fast first filter that flags likely problems for a human, not as the final word. The overarching rule: a model judge is a useful instrument, but like any instrument it needs calibrating against ground truth, which is why a human-reviewed slice of your golden set never fully goes away.

This is Part 18, the close of Phase 3, in a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. Evaluation is how you keep the hallucinations from Part 11 and the RAG of Part 13 honest.

The Bottom Line

Looking good is not the same as being good, and the only way to tell the difference at scale is to measure. Build a small, stable golden set of real cases, choose a grading method that fits each task, humans for nuance, automatic checks for crisp answers, a model judge for open-ended scale, and run the whole thing as a harness on every change so a single number tells you whether you improved or regressed.

The mindset I would press hardest is this: treat “it looked fine” as the start of evaluation, not the end of it. The teams whose AI products keep getting better are simply the ones who can measure whether they are. That closes the practical phase of the series. From here we go under the hood again, into the engineering that decides quality and cost, starting with a claim that surprises people: more often than not, it is the data, not the size of the model, that decides how good it is.

References

Judging LLM-as-a-Judge (MT-Bench) (Zheng et al., 2023)
Using an LLM as a judge: a practical guide (Hugging Face)
Your AI product needs evals (Hamel Husain)

Generative AI Series · Part 18 of 30
« Part 17: multimodal AI | Generative AI Complete Guide | Next: Part 19, data over model size »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Dr. Pranay Jha