A team at Vercel took an AI agent that was getting about four out of five tasks right, deleted most of its tools, and it started getting all five. Same model. They just gave it less to work with.
The agent is d0, their internal text-to-SQL agent. The old version had around fifteen specialized tools: one to load the catalog, one to find join paths, one to validate syntax, one to finalize the plan, and so on. The rebuild threw almost all of them out and replaced them with a shell and a SQL runner. The agent now browses the data model with the same grep and cat any engineer would use, then runs a query. Two tools where there had been fifteen.
On the same model, Claude Opus 4.5 before and after, Vercel reported the rebuild was about 3.5 times faster, took roughly 40 percent fewer steps, and used about a third fewer tokens. Take the exact figures lightly: it is their own benchmark, five queries, and the rebuild changed the whole architecture, not just the tool count. But the shape of it keeps turning up. An agent with less to choose from did more.
The instinct is backwards
When an agent underperforms, the reflex is to add. Another tool. Another MCP server. Another capability it seemed to be missing. The whole MCP marketplace points the same way: thousands of servers, an npm for agents, connect everything and let the model sort out what to use.
The thing slowing the agent down is often the size of the menu. And the real lever is not how many tools you own. It is how many the model has to look at when it picks, and whether it can tell them apart.
What the model actually sees
Every tool you expose is a description the model reads before it acts. It sits in the context window, everything the model has to take in before choosing. The more tools there are, the more it holds at once, and the more they blur together.
That blur is measurable. Microsoft Research surveyed 1,470 MCP servers and found 775 tools that shared a name with a tool on another server. "search" showed up across 32 different servers. "get_user" appeared eleven times. One server alone exposed 256 tools. Wire a handful of those into one agent and ask it to pick the right one, and it often will not.
There is a token tax too. Anthropic measured an ordinary five-server setup spending around 55,000 tokens on tool definitions before the agent does any work, and has seen definitions alone run to 134,000 tokens. You pay that on every call.
The picking itself degrades as the pile grows. In one study, an agent handed every tool at once chose the right one about 14 percent of the time; given a small relevant subset first, 43 percent. Accuracy held up under about thirty tools, wobbled through the middle, and fell away past a hundred. It is the same failure as long-context degradation, now pointed at actions. More definitions to read, more chances to read the wrong one.
The vendors say it out loud
Anthropic's own guidance on building tools says, plainly, that "more tools don't always lead to better outcomes," that "too many tools or overlapping tools can also distract agents from pursuing efficient strategies," and that you should "build a few thoughtful tools" instead of wrapping every API endpoint. OpenAI's function-calling guide says to "keep the number of initially available functions small for higher accuracy," and to aim for fewer than twenty at a time.
GitHub trimmed Copilot's default toolset from about forty tools to thirteen and reported a 2 to 5 point gain on coding benchmarks across two different models, plus lower latency. Their line: "Too many tools doesn't always make it smarter. Sometimes it just makes it slower."
Count is the symptom, not the cause
To be clear about the evidence: no clean independent study has isolated tool count by itself. The vendor numbers bundle the count change with other changes, and the academic work tangles it with context length and retrieval. Count is the visible symptom. The cause is how many tools the model sees at the moment it chooses, and whether it can tell them apart.
That is the useful part, because it means you do not have to throw tools away. You can own a thousand and still put a handful in front of the model when it decides. Retrieval does this. So does loading tool definitions on demand instead of all upfront, which in one Anthropic example took a workflow from 150,000 tokens to 2,000. Deleting tools and keeping many but showing few land in the same place. What breaks agents is wiring everything into one context at once, not the marketplace existing.
What to do instead
Three things, roughly in order.
Consolidate. Replace one-tool-per-endpoint with a few that match what the agent is actually trying to do. Block rebuilt its Linear integration from more than thirty tools, one per API call, down to two: run a read query, run a write query. A question that used to take four tool calls now takes one. Vercel went further. The tool became a shell.
Scope per decision. Do not lay every tool on the table at once. Group them by service so the names stop colliding, retrieve the relevant few per task, or load definitions only when they are needed. The system can hold a lot. Each decision should see a little.
Describe before you delete. Often the problem is not the number of tools but that the model cannot tell what they are for. Anthropic reached state of the art on a coding benchmark by refining tool descriptions alone. But it is real work, and it can backfire: one study found that rewriting descriptions lifted success by about six points while adding two-thirds more steps and making one case in six worse. Curation is not free.
Closing
I have argued before that every tool you give an agent is an attack surface. It is also a place the agent can go wrong. Each tool is one more thing it has to understand, choose between, and not misfire. Agentic search beats a library of endpoints for the same reason a shell beats fifteen wrappers: the model does more with a few sharp things it can do than with a wall of options.
When an agent is failing, the next move is usually not another tool.