Testing AI Systems Is Harder Than Testing Software

AI systems broke the way we test software. In traditional code, you write a test, it passes, you move on. That model does not survive contact with non-deterministic outputs, unreliable ground truth, and biased evaluators.

Non-Determinism Changes Everything

Ask the same question twice and you may get two different answers, both correct. Running a test once tells you very little. You need multiple runs to get a reliable signal.

Consider a coin. You want to know whether it is fair (50/50) or slightly biased (55/45). To reliably detect that bias at 95% confidence, you need roughly 1,000 flips. If the bias is subtler, like 51/49, you need tens of thousands.

The same math applies to AI outputs. You are trying to determine whether a change to your prompt or model improves output quality, and the difference between "better" and "about the same" is often slim. There is no shortcut around the statistics.

Your Ground Truth Is Probably Wrong

The standard approach is straightforward: curate a test set of inputs with expected outputs, run the model against it, and measure agreement. It has a fundamental flaw.

I have repeatedly seen cases where the model disagrees with the expected answer and is more correct. On one classification system, the model consistently categorized certain inputs differently than our annotators had. On human review, the model had picked up on a distinction the annotators missed. But the automated evaluation marked every one of those cases as wrong.

You can only measure the model as well as your ground truth allows, and for non-trivial tasks, ground truth is almost always open to interpretation.

LLM-as-Judge Has a Political Problem

One popular workaround is using a language model to judge outputs. Instead of comparing against a fixed expected answer, you ask a model which of two responses is better. This scales well and can capture nuances that exact-match metrics miss.

But LLM judges have systematic biases. Research has documented several: verbosity bias, where longer answers are rated higher regardless of quality. Position bias, where the answer presented first tends to be preferred. And self-preference bias, where a model rates outputs from its own family of models more favorably.

In practice, this creates a kind of political dynamic. The model that wins in an LLM-judged evaluation is not necessarily the most correct. It is the most convincing, the most fluent, the most agreeable.

The fix is calibration. Run the judge alongside human reviewers on the same set of examples. If they agree 85% or more of the time on your specific task, the judge is useful. Below that, do not trust it.

Automated Testing as a Regression Detector

The practical role of automated testing is regression detection. Establish a baseline: here is how the system performs today across a set of representative inputs. When you make a change, run the same set and compare. You are not trying to prove the system is good. You are trying to detect when it gets worse.

Invest in the inputs more than the expected outputs. A diverse, representative set of real user inputs is more valuable than a perfectly curated set of expected answers. You can always add human review on top when the numbers shift.

Model Upgrades Are Not Always Upgrades

It is tempting to assume that a newer model version is better across the board. It usually is on average. But "better on average" is not the same as "better for your specific use case."

On one production system in late 2025, GPT-5 outperformed GPT-5.2 on a specific type of tool use we call skill execution. The newer model was better at almost everything else, but worse at this particular task. GPT-5.4 surpassed both across the board. The pattern is not linear.

A model that excels at structured output might struggle with open-ended reasoning. A model that handles English beautifully might fall apart in another language. Every model upgrade should include manual review on your actual use case.

What to Do About All of This

Build a representative test set from real user inputs. Establish a baseline. Run automated comparisons on every change to catch regressions. Use LLM-as-judge for scale, but calibrate it per evaluation dimension, not just per task. And keep humans in the loop as the source of truth that everything else is calibrated against.

AI evaluation is not a problem you solve once. It is a discipline, and the teams that treat it as one ship better systems than the teams with better models.