The Eval Was Lying to Me

I shipped a research skill with a bundled example: the sample output that tells anyone who downloads it, this is what good looks like. My gold standard. Before I let it stand for the work, I made myself do one boring thing and verify it, re-fetching every source it cited and checking the claims against the live pages.

It didn't hold. A citation dated two weeks after the research it supposedly came from. An internal record that contradicted itself. The showcase of the skill's discipline, quietly breaking that discipline.

Here is the part that matters. That example was written by an AI, like the skill around it, and like the eval that had passed it. I had built the whole thing the modern way: take your expertise, hand it to a model, layer review tools on top, give the agent room to work. On paper it was rigorous. The only reason I caught anything is that I refused to trust the gold standard until I had checked it myself.

So I built a stricter gate, one that re-derives the truth instead of asking a model to vouch for it, and I ran it on the four briefs I had locked as my known-good baseline. It blocked every one.

The annoying truth lives right here at the top, so you don't have to wait for it. A passing eval can be the thing lying to you. A green check is not evidence the check can see the lie. And the thing that saves you is not the cleverness of your tools or the autonomy of your agent. It is the discipline of refusing to trust an output you haven't verified.

The belief I held

Start with what a skill even is, because it is why I reached for one. A skill is a reusable set of instructions that hands an AI agent a specific job and the discipline to do it the same way every time. I had an SOP's worth of hard-won expertise, the kind that lived in my head and a few documents, and I wanted it to run without me. A skill is the obvious vehicle, so I did the obvious thing and handed the writing to a model.

That was the first mistake: asking a vanilla LLM to research a domain and write the skill, more or less unsupervised. I corrected it the way you are supposed to. I layered a purpose-built reviewer on top, Anthropic's skill-creator, and asked for critiques. I gave the agent autonomy, which is not a bad thing. Most of the clarity I got came from asking the model questions, which is also not a bad thing.

And I stood up an eval: an automated test that grades the skill's output and returns a pass or a fail. It had fixtures (realistic research scenarios), a rubric, an LLM judge that read each brief and graded it against the discipline, and a locked baseline of four briefs I treated as the gold standard. On paper it was rigorous. It had all the parts.

What I did not have was any understanding of what each part was for. I had assembled the furniture of an eval without knowing which piece holds the weight. I believed that having an eval, and layering review on top, was the safety. A green check is a powerful invitation to stop looking. The only reason this is a lesson and not a disaster is that I didn't take the invitation: I made myself verify the one output I was about to let stand for the work. The belief that the eval was the safety was the real bug, and it is the one you can't catch by looking. The habit of checking anyway was the fix.

A producer can't certify itself

The part doing the grading was an LLM reading another LLM's citations, and those briefs passed it despite real, defective citations. Why a model-grader lets that through is worth being careful about, because I cannot see inside it. What I can say is that an LLM handed a well-formed citation tends to accept that a citation exists rather than re-deriving whether it holds, the way a person proofreading their own writing reads what they meant instead of what they wrote. Every claim had a citation. The citations just didn't hold.

This is not a prompt problem you fix by being sterner. It is structural. A producer cannot certify its own grounding, and an "LLM judge" is very often the producer wearing a disguise. Hand a model work that resembles its own and it tends to prefer it. That is not just my inference. Anthropic names the same effect as a reason for a feature they shipped on June 2, 2026, an essay and tool titled "A harness for every task: dynamic workflows in Claude Code." In their words: "Self-preferential bias refers to Claude's tendency to prefer its own results or findings." The exact thing that had been quietly passing my broken briefs, written into the launch docs as a reason to build scaffolding around the model.

So now the first question I ask of any check is blunt: could the thing being graded have written the input to this check? If yes, it isn't a check. It's a mirror.

The fix: re-derive truth, never trust a producer-written field

The fix is one principle, and it is the most load-bearing idea in this essay: re-derive truth from the source, and never trust a field the producer wrote.

My rebuilt gate has a bottom layer that uses no LLM at all. For every citation in a brief, it re-fetches the live page and string-matches the quoted text and every number against what is actually on it. Code, not judgment. A liar running grep still gets the truth.

The first version of that floor had a bug that taught me the other half of the lesson. One of its checks takes every figure asserted in a brief and confirms the number is actually backed by a cited source, not just by the quotes the writer chose to save. It flagged a $120,000 figure as unsupported, and I nearly shipped the "catch" as proof the gate worked. The figure was real, verbatim on the cited government page. The bug was where the check looked: it searched the short excerpt the writer had saved to the ledger, which happened to be truncated and didn't contain the number, instead of the live page, which did. The check had trusted a producer-written field, the saved excerpt, over the source itself, and it failed faithful work in the exact way the floor exists to prevent.

Re-anchoring that check to the live page fixed two problems at once. First, it stopped failing real work: the $120k is genuinely on the page, so now it passes. Second, it closed an obvious way to cheat. The old check read the writer's own excerpt, which means a writer could get any number approved just by typing that number into the excerpt. The new check ignores the excerpt and reads the page, which the writer can't edit, so that trick stops working.

That combination is rarer than it sounds. Usually a stricter check is a trade-off: it catches more bad work, but it also flags more good work by mistake. Here there was no trade-off. Pointing the check at the real source made it more accurate and harder to game at the same time, because the source is the one thing the writer can't rewrite.

So now every check I build starts with one question: what is its input? If the input is something the author could have written, the author can pass the check by writing it. Point the check at the page, the file, the API, anything they don't control.

What a harness actually is

Step back, because the words matter and most arguments about evals are really vocabulary confusion. An eval is a test: a set of inputs (the fixtures), a definition of good and bad (the criteria), and something that decides pass or fail (the judge). A harness is the rig that runs that test. It feeds the fixtures in, applies the judge, and reports a verdict, repeatably, on every change, without a human eyeballing it.

The thing nobody tells you is that the judge you are allowed to build is dictated by your output, not your ambition. It slides along a spectrum:

Deterministic floor. When "good" can be checked against a ground truth you can re-derive. My research skill earns this because a claim's truth lives on a fetchable page. Un-cheatable, zero variance.
LLM judge. When "good" is a judgment with no fetchable ground truth. I have an onboarding skill that writes brand positioning and voice. There is no live page to re-fetch "good positioning" against; "is this on-brand?" is inherently a call. So that skill is graded by LLM judges, and it cannot have a deterministic floor. Not a shortcut. An impossibility set by the output.
Live-run. When "good" is emergent and multi-stage, and a fixture captures it poorly. A content pipeline I built is validated by running it end-to-end against real data and watching what breaks, not by a synthetic gate.

This is the half "a harness for every task" leaves implicit. The phrase is right: you can now write a harness for anything. But a harness's power is set by one question it does not ask out loud: can you mechanically tell good from bad? For a lot of knowledge work you can't, and pretending otherwise is how you end up with a gate that grades vibes and feels like rigor.

Fixtures test the skill. Anchors test the gate.

Here is the move that separates a real eval from a comforting one. There are two jobs, and they use two different materials.

Fixtures are inputs: a scenario, plus the traps it should dodge, drawn from the known ways this kind of skill fails. You run the skill on a fixture and grade the output. Fixtures test the skill.
Anchors are frozen outputs with a known-correct verdict. You feed them to the gate and check the gate's verdict. Anchors test the gate, because the gate is itself software, and software rots.

Anchors come in two flavors, because a gate can be wrong in two directions:

Recall. Of everything that is actually bad, how much did you catch? A smoke detector that stays silent during a fire. My LLM judge passing four broken briefs was a recall failure.
Precision. Of everything you flagged, how much was actually bad? A smoke detector that shrieks at toast. My $120k false alarm was a precision failure.

My original suite was made entirely of traps: scenarios that should fail. A trap-only suite can measure recall and is structurally blind to precision. No fixture ever asserted "this clean brief must pass untouched," so the $120k over-flag could have lived there forever. The fix is positive baselines: known-good briefs, authored with the discipline, that must pass the full gate clean. They earn the label "known-good" not by my say-so but by surviving the un-cheatable floor, every citation re-fetched, every number matched.

So now every gate I build ships with at least one must-pass case and one must-fail case. A suite of only-should-fail tests is half a gate: the half that does not protect users from your own strictness.

How much eval does a skill actually need?

Not every skill earns the cathedral above. Building a full harness for the wrong skill is its own failure; I have shipped an eval that masqueraded as safety and protected nothing. The depth a skill deserves comes down to two questions:

How open-ended is the output? This decides which layers you need.
What does being wrong cost? This decides how rigorous, and whether to build a harness at all.

Plot a few skills on those two axes and it gets obvious fast:

Skill	Output	Cost of wrong	What it earns
A "grill me" prompt	open-ended	~zero (you're in the loop)	No harness. You're the judge.
A booking-confirmation filler	mechanical	customer-facing	Rigorous but shallow floor. No panel.
A blog post	open-ended	published, brand	floor + LLM judge + outcome
A research deliverable	open-ended	high (a wrong fact in a client's hands)	the full thing

A "grill me" skill has no artifact to grade, and you re-prompt the instant it is weak, so building an output harness for it is wasted work. A booking filler is mechanical, but it is customer-facing, so its floor has to be airtight even though it is shallow: not just "is the template filled," but "does every field match the source." A beautifully formatted confirmation with the wrong date still ships.

For the genuinely open-ended, high-stakes case, the honest decomposition is three levels, not two:

Mechanical floor: structure and atomic facts. "Open-ended" describes the whole, not the atoms. A blog post's prose is open-ended, but every stat it cites is re-fetchable, exactly like a research brief. Push everything you honestly can down here. It is the only layer with zero variance.
LLM judge: the genuine-judgment residue (voice, brand-fit, persuasiveness), run as an adversarial panel (several judges instructed to attack the work rather than bless it) and treated as a signal, not a verdict.
Outcome: did it actually perform. There is no pre-ship ground truth for "this post will land." Only reality answers it, after you ship. That is not a gap in your harness; it is the nature of the task, and it is why you validate that top layer by watching real engagement, not a synthetic judge.

The annoying truth about the new tooling

Anthropic's dynamic workflows make all of this dramatically cheaper to build. A dynamic workflow is a JavaScript script that orchestrates subagents at scale, including the exact pattern this essay is about: "For each spawned agent, run a separate spawned agent to adversarially verify its output against a rubric or criteria." My own research gate runs its adversarial verifier panel as one of these workflows. The orchestration that used to be bespoke engineering is now a primitive.

But here is the line to keep: tooling lowers the cost of a harness. It cannot raise the ceiling on how much you can trust it. Cheap orchestration of LLM judges is still LLM judges, and that ceiling is set by your output's verifiability, a property of the thing being graded, not of the rig grading it. A workflow hands you a faster way to build the panel. It does not make "is this good positioning?" mechanically answerable.

One concrete consequence, because a workflow is non-deterministic wherever its judges are LLMs: run your regression off the deterministic layer, and treat the LLM panel as advisory. Re-run a fixed floor on a frozen anchor and any change in the verdict is a real change. Re-run an LLM panel and the verdict can flicker on its own. So I pin my regression anchors on the floor; the panel is a trend signal I sample and spot-check, never a frozen pass or fail. A regression test is only as stable as its instrument, and stability comes from the judge being code.

The eval is never finished

The last thing I want to leave you with is a realization, not a correction, and it is the one I would least want you to skip. It is natural to aim for a complete eval: enough fixtures, enough traps, and you are covered. You are not. A skill meets inputs in the wild you never imagined, carrying traps you never wrote a fixture for. No suite anticipates all of them, and no skill is ever perfect, because the next input can always carry a trap you did not design for.

So the eval is not a thing you finish. It is a thing you grow. Every time the skill fails in production in a way your gate did not catch, that failure is the most valuable fixture you will ever get: a real trap, bought at the price of one mistake, that you encode so it can never pass again. The harness gets smarter from reality. A gate that never changes after launch is one that stopped learning the day you stopped looking.

Try this on your next skill

Before you write a single fixture:

Verify the gold standard before you trust it. The example you ship, the baseline you lock, the "known-good" output you measure against: re-derive it from its sources yourself before you let it stand for anything. The one safeguard that saved me was the one I almost skipped.
Answer the two questions. How open-ended is the output, and what does a wrong answer cost? That tells you whether you need a floor, a floor plus a judge, or just a human in the loop.
Push everything checkable down to a deterministic floor. That includes facts, not just format. Anchor each check to a source the producer can't edit.
Use an LLM judge only for the residue, and treat its verdict as a signal. Diversify the judges, make them adversarial, and never let one grade what code could have graded.
Ship two anchors minimum, one that must pass and one that must fail. If your suite can't catch your own over-strictness, it is testing half the gate.
Plan to grow it. When the skill fails in a way your gate missed, encode that failure as a fixture before you fix anything else. The eval is never finished.

The whole discipline collapses into a few sentences I keep coming back to. Trust checks you can't cheat. Check the checker against work you already know is good. And when you have handed the building to a model and it hands you back something that looks finished, the move that protects you is the boring one: verify the gold standard yourself, because the day your skill gets good enough to fool you is the day you most need a gate, and a habit, that can't be fooled.

At Aivant, we help teams build AI skills they can actually trust, and the evals and harnesses that prove it. If you're shipping something where being quietly wrong is expensive, book a discovery call.

The Eval Was Lying to Me

The belief I held

A producer can't certify itself

The fix: re-derive truth, never trust a producer-written field

What a harness actually is

Fixtures test the skill. Anchors test the gate.

How much eval does a skill actually need?

The annoying truth about the new tooling

The eval is never finished

Try this on your next skill

References & Notes

Want to scope a build for your team?

The belief I held

A producer can't certify itself

The fix: re-derive truth, never trust a producer-written field

What a harness actually is

Fixtures test the skill. Anchors test the gate.

How much eval does a skill actually need?

The annoying truth about the new tooling

The eval is never finished

Try this on your next skill

References & Notes

Want to scope a build for your team?

I Turned 21,000 Lines of Code Into 43 Files.

Code or SDK: When You Actually Need the Agent SDK

Why I Don’t Let My AI Agents Plan