Browser Agents at Human Speed

In my testing, OpenAI Operator takes around 10 minutes to book a flight. Anthropic's computer use agent averages roughly 5 minutes on comparable tasks. A human takes three.

That is not a capability problem. It is a latency problem, and in browser agents, it does not add — it compounds.

The Loop

A browser agent does not think once. It cycles: observe the current state of the page, decide the next action, execute it, repeat. A Google Flights search averages 36 steps. A competitive research task across multiple sites can run to 60 or more.

Every step is a full LLM call. On standard GPU-based inference, that call takes somewhere between 3 and 6 seconds — network overhead, time to first token, generation. The benchmark result: 36 steps at 6.8 seconds each totals over 4 minutes. Human time for the same task: under 3 minutes.

The agent is slower, and it costs more to run.

Task Time Is the Real Metric

Standard benchmarks like WebVoyager measure whether the agent eventually succeeds — not how long it takes. An agent that completes a booking in 10 minutes scores the same as one that completes it in 60 seconds. This is not a useful measure of production readiness.

The relevant comparison is human time on the same task. If the agent is slower, nothing has been automated. The task has just moved to a different queue. A browser agent becomes genuinely useful when it matches human speed with comparable reliability. Below that line: demo. Above it: product.

Three Changes That Got Me There

Getting my implementation to roughly 2 steps per second — 18 seconds of inference time for that 36-step flight search, completing the full task in around 90 seconds including page loads — required three changes.

Replace screenshots with DOM text. Most commercial agents process a screenshot at every step. The model receives an image of the page and decides what to do next. This is the obvious approach and also the expensive one: screenshots are large, vision inference is slow, and provider options are limited to multimodal models.

Here is the counterintuitive part: the impressive capability — visual reasoning over rendered pixels — is not on the path to fast, production-ready agents. That route goes through boring text parsing of the DOM, which eliminates a category of overhead entirely and opens up a much wider choice of inference providers. Done well, this gives the model as much usable information as a screenshot, without the image overhead or the multimodal model requirement. The tradeoff is upfront engineering work: a DOM-to-text pipeline that reliably captures interactive elements, dynamic content, and state changes across the variety of sites you need to handle. Once built, a major source of latency is gone.

Instrument before you optimize. After switching to DOM text, I instrumented every step of the pipeline expecting the LLM call to remain the bottleneck. It was not. Pulling and parsing the page state was taking around 350 milliseconds — nearly double the model inference time at that point. The LLM gets all the attention, but the steps around it are where time actually accumulates. This pattern shows up reliably across agent pipelines. Getting the full pipeline under 500ms was an exercise in finding and eliminating those hidden costs one at a time.

Route to fast inference hardware. Standard GPU-based API endpoints are built for throughput across many concurrent requests, not for per-request latency. Individual requests queue; generation is slower per token. Groq's LPUs pursue a different goal: minimize time to first token. For current text models, they deliver sub-500ms responses including network round-trip. Cerebras uses a wafer-scale chip with 44GB of on-die SRAM to eliminate the memory bandwidth bottleneck that slows GPU clusters. Their current numbers for OpenAI's GPT-OSS model are over 3,000 tokens per second, with time to first token around 280 milliseconds. That model — GPT-OSS-120B — sits on par with GPT-5-mini on general benchmarks and is competitive with GPT-5 specifically on browser use tasks. Routing to fast hardware does not require trading model quality. For a browser agent running a specific model in a production loop, they change the arithmetic entirely.

The Numbers

At 500 milliseconds per step, the 36-step Google Flights task takes 18 seconds of inference time. Add page loads and the full task completes in around 90 seconds — faster than a distracted human, competitive with a focused one.

The same task on a standard API endpoint at 6.8 seconds per step takes over 4 minutes. That is slower than a human doing it manually, with no reliability advantage. That gap is not incremental. It is the difference between a product and a proof of concept.

The compounding gets worse the longer the task. A 60-step research task at 6.8 seconds per step runs to nearly 7 minutes of inference alone. At 500 milliseconds per step, those same 60 steps take 30 seconds. The more complex the task, the more the hardware choice matters.

Faster Is Also Smarter

Faster inference does not only mean faster answers — it means more reasoning within the same time budget. This is not obvious, and it matters more than it seems.

A slow agent is forced into greedy, linear execution: commit to each action, execute, move on. There is no time budget for anything else. A fast agent operating within the same total task time can do something qualitatively different — take an action, detect that the result is not what was expected, backtrack, try again. It behaves less like a script and more like someone actually paying attention.

Cerebras has documented this as a scaling property: a model running significantly faster can explore more reasoning paths within any given latency constraint. For a browser agent making sequential decisions in an environment where mistakes compound — each wrong action leaves the page in a state that makes the next action harder — this is not a small benefit. Speed is what buys you the ability to be uncertain and recover from it. Slower systems cannot afford that uncertainty.

Speed and reliability share the same foundation. The hardware that makes agents faster is the same hardware that makes them less fragile.