Every AI system I have built has two performance curves.
The peak is what the model does when it works well. The impressive completion. The demo-worthy output.
The floor is what it does on the tenth run. On the hardest input. On a user prompt that looks almost identical to the one you tested, except for a phrase that shifts the model's behavior in a way you cannot explain. The morning after the provider pushes a minor update you did not notice.
The gap between the two is where production systems live or die.
A lot of what looks like "AI is unreliable" in the industry is really AI being deployed by teams who built for the peak. They watch the demo, see the model nail the task, and build a pipeline that assumes the peak is stable. Then the floor shows up, and they spend the next six months writing guardrails around a system that was never designed to need them.
The alternative fits in a sentence: build dependencies on the floor, and treat the peak as upside.
The Peak Is What Gets You In
The peak is what gets the project funded, the roadmap written, and the launch announced. The floor is what the engineering team discovers at 2 a.m. on a Tuesday three weeks after shipping.
This is the pattern I see repeatedly in architecture reviews. A system designed around a moment of model brilliance, with no graceful degradation for when that brilliance does not arrive. The client saw the demo. The demo worked. The model did the thing. And then in production, it does the thing maybe seven times out of ten, which is worse than what the roadmap promised.
For every LLM call in your pipeline, answer two questions:
1. What does it need to do reliably? Same class of input, same class of output, across a thousand runs, under messy real-world conditions. 2. What does it sometimes do impressively? Moments when the model produces a better answer than you expected.
Build downstream dependencies only on the first. Treat the second as upside you route through verification, or fall back gracefully when it does not arrive.
Three Concrete Moves
Build your evals from production failures, not from benchmark suites. For anything with open-ended output, your eval should mirror the floor you actually have, not the peak you want to market. Take your last hundred production failures, cluster them, and build evals around what you find. Benchmark suites flatter the model. Failure clusters tell you what is actually breaking. I wrote about this at length in the evaluation gap.
Test for floor regression when you upgrade models, not just peak progress. Every new model release is better on benchmarks than the last. That tells you the peak moved. It does not tell you the floor on your specific tasks held. New versions fix some failures and introduce others. Diff your last hundred production failures against the new model and pay attention to what got worse.
Give the peak room to surprise you without letting it carry load. If the model sometimes produces something brilliant, that is useful. Log it. Route it through human review for workflows where brilliance matters. Use it to train or prompt future versions. But do not build a pipeline that breaks when the brilliance does not arrive, because often it will not.
The Cost
The peak is what closes deals. The floor is what keeps customers. Designing for the floor is slower and harder to celebrate internally, and most teams eventually learn this through their first production deployment.
Ship the floor. Celebrate the peak.