Back to Insights
Evaluation

The Evaluation Gap

By Mathijs Boezer

Last year I built an evaluation pipeline for a client's agent system. Ten LLM judges, three candidate outputs per prompt, thousands of pairwise votes. The system felt rigorous. Democratic, even.

The scores had almost no correlation with what users actually preferred.

This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The concept comes from economics, but it applies directly to AI evaluation today.

Building AI systems got dramatically easier this year. Evaluating them did not. That asymmetry is where most production failures live, and the companies building agents seem to know it. Some of the most revealing announcements this month have not been about new capabilities. They have been about monitoring, permissions, evaluation infrastructure, and incremental trust. When the vendors selling autonomy start shipping guardrails, that tells you where the bottleneck actually is.

The Measure Trap

The LLM-as-judge approach is the default answer right now. Run your agent on a set of inputs, have a panel of language models score the outputs against criteria you define, and let the numbers decide. With enough judges and enough samples, the statistics look solid.

The problem is that LLM judges have their own preferences. They favor longer responses. They can be swayed by presentation, verbosity, and other irrelevant style features. They are inconsistent across runs. And once you optimize your agent against this panel, you are not improving quality. You are improving performance on a proxy for quality. The distance between the proxy and the real thing grows the harder you optimize.

This is worse than having no metric. A team with no metric knows it is guessing. A team with a confident eval score believes it is measuring. The second team ships faster and is wrong more expensively.

The social science version of this is Campbell's Law: the more a quantitative indicator is used for decision-making, the more it corrupts the process it was meant to monitor. Teaching to the test does not produce better students. Optimizing for LLM judge scores does not produce better agents.

When the Test Is Wrong

In traditional software, tests are ground truth. Code passes or fails, and that result means something. AI systems break this assumption in a way most engineering processes are not designed for.

I have had cases (more than once, on real client work) where the AI output was correct and the test case was wrong. If the pipeline had been fully automated, it would have penalized the correct output and promoted an inferior one. Traditional software has bad tests too. But in AI systems, ambiguity is structural rather than incidental, and the output space is open-ended.

This scales beyond individual test cases. METR analyzed SWE-bench Verified results and found that roughly half of test-passing pull requests would not actually be merged by repository maintainers. The benchmark said pass. The humans said no. The Data Agent Benchmark found a similar gap from a different angle: the best agent scored 38% on its first attempt but 69% across 50 tries. That is not evidence of reliability. It is evidence of variance.

Recent reliability research out of Princeton supports this: while capability across frontier models has improved rapidly, consistency scores still range from 30% to 75%. The gap between a model's first attempt and its fiftieth is the gap that kills production deployments.

The Fix Is Upstream

The instinct is to respond to bad eval with better eval: smarter judges, more rubric dimensions, larger sample sizes. But the highest-leverage work is not in measuring quality more precisely. It is in building systems that produce reliable quality in the first place. This is not an argument against evals. It is an argument against treating evals as the primary lever. In my own work, the biggest reliability improvements came not from better metrics but from changes to the system being measured.

Google demonstrated this recently with a Gemini API developer skill, a grounding mechanism that routes the model toward current documentation as a source of truth. Success on a coding benchmark jumped from 28.2% to 96.6%. Not from a smarter model. Not from a better prompt. From giving the model the right context at inference time.

Salesforce arrived at the same conclusion from the enterprise side, saying explicitly that innovation in AI is now happening "at the system level, not the model level." The agents that work in production are not the ones with the best benchmarks. They are the ones embedded in systems that constrain, ground, and correct them.

OpenAI's guidance on agent security makes the same point in different language: defense against failure cannot rely on filtering inputs alone. You have to design the system so the impact of a bad output is contained even when it occurs. That is reliability engineering, not model engineering. You cannot eval your way out of a design problem. Trying to is itself a Goodhart trap.

Evaluation Is a Human Problem

If every automated metric eventually falls to Goodhart's Law, the answer is not a better metric. It is faster human judgment. The best evaluation frameworks I have worked with do exactly that. They surface the outputs most likely to need review, flag disagreements between automated checks, and structure the evaluation workflow so it scales without losing signal.

This matches what the data says about how experienced practitioners actually work. Anthropic's March Economic Index found that the most effective users are more likely to iterate with the model, not to hand off and walk away. Their separate coding trends report puts a number on it: even among developers who say AI is involved in roughly 60% of their work, only 0 to 20% of tasks are fully delegated. The rest requires supervision, correction, and judgment.

OpenAI appears to agree in practice. They recently disclosed that they monitor nearly all internal coding-agent deployments, reviewing tens of millions of trajectories. They found models can be overly eager to work around restrictions in pursuit of the user's goal. If OpenAI does not fully trust agentic coding inside their own walls, the technology is not at the point where anyone else should either.

The best framing I have seen from any vendor came from Salesforce: "Autonomy is granted incrementally." That is not a marketing line. It is an operating principle. And the right one.

The teams that win will not be the ones with the highest eval scores. They will be the ones who built systems whose failures are visible, bounded, and recoverable.