After instrumenting every step of our agent pipeline, we found the LLM accounted for a small fraction of the total latency. The unmeasured parts were where the time went.
The Model Is No Longer the Slow Part
The inference landscape has shifted. Open models now achieve near-frontier quality at extraordinary serving speeds. OpenAI's GPT-OSS 120B uses a mixture-of-experts architecture with only 5 billion active parameters, which means it fits on a single GPU and serves at over 1,000 tokens per second on optimized hardware. This is a model that competes with o4-mini on reasoning benchmarks, running at speeds that make the LLM step nearly invisible in a pipeline.
Providers like Cerebras, Groq, and Fireworks AI push speeds even higher across their model catalogs. Routing services like OpenRouter automatically select the fastest available provider. At over 1,000 tokens per second, generating a 130-token response (roughly 100 words) takes about 130 milliseconds of generation time. Even accounting for network overhead and time to first token, the full LLM step comes in under 300 milliseconds.
Where the Time Actually Goes
On one project involving browser state interpretation, the system received a text representation of the DOM and used an LLM to make decisions based on it. The natural assumption was that the LLM step was the slow part. After adding instrumentation, it turned out that extracting the DOM from the browser was averaging around 350 milliseconds, nearly double the model inference time.
I have seen the same pattern with MCP tool connections. On one system, the first user message triggered initialization of all MCP tool servers: checking availability, establishing connections, loading schemas. This warm-up step took roughly 4 seconds on a first call that totaled about 5 seconds end-to-end. After pre-warming connections and caching schemas, the full pipeline came down to approximately 400 milliseconds, including complete response delivery.
Parallelization
The simplest structural optimization: eliminate unnecessary sequential steps.
Classification and retrieval can often run simultaneously. Post-processing can begin on partial output while generation continues. A six-step waterfall can frequently become a three-stage pipeline with no change in logic, just in timing.
Small Wins That Compound
Several smaller optimizations add up:
- Connection pooling. Reusing HTTP connections to model providers avoids the overhead of establishing new connections on every request. This saves 50 to 80 milliseconds per call.
- Prompt caching. Most providers now support caching of system prompts, so identical prompt prefixes do not need to be reprocessed on every request. If your system prompt is long, this is a significant win.
- Provider routing. Using a service like OpenRouter with a latency-optimized routing policy means you consistently get fast inference without manually benchmarking providers.
- Edge deployment. Moving lightweight routing logic to the edge cuts 20 to 30 milliseconds of network latency for most users.
Together, these can cut total pipeline latency by 30% to 40%.
Streaming and Gradual Reveal
With slower models, streaming is essential. Showing tokens as they arrive reduces perceived wait time because the user sees something happening.
With fast models, the dynamic shifts. When generation takes 100 to 200 milliseconds, roughly 90% of the user's wait is the time from sending the request to receiving the first token. Streaming matters less.
This creates a different UX problem. If a long response appears all at once, it can feel jarring. The approach I have found most effective is a gradual reveal: render the response in smooth increments with a flush at the end. This takes tuning, but it makes responses feel fast and natural.
The Lesson
Measure everything. Fix the slowest part. It is probably not the LLM.