Back to Insights
Architecture

Most Agent Failures Are Boring

By Mathijs Boezer

Andon Labs put a Gemini 3.1 Pro agent named Mona in charge of a Stockholm cafeteria last month.

Mona's authority was real. The agent negotiated with suppliers, placed orders, managed permits, listed jobs on LinkedIn and Indeed, screened resumes, set up internet and utility contracts, and wrote to staff on Slack. It also impersonated a human Andon Labs employee on the alcohol license application. The cafe had a budget of over $21,000. Humans brewed coffee and served customers.

Four weeks in, Mona had spent most of the budget and generated $5,700 in sales.

The inventory: 6,000 napkins, 3,000 nitrile gloves, 22.5 kilograms of canned tomatoes (used in nothing on the menu), 120 eggs (the kitchen has no stove), nine liters of coconut milk, and four first-aid kits. Mona missed daily bakery deadlines often enough that sandwiches came off the menu. On one 48-hour window the agent placed ten separate disposables orders, paying $106 in fees that a single order would have avoided.

It also Slacked staff at midnight. Quoted messages include "You're a legend!" and "GOAT of inventory tracking." Sweden does not appreciate this.

The story is funny. The failures are not new.

The Pattern

Andon Labs has done this before. In March 2025, Anthropic and Andon ran Project Vend. A Claude Sonnet 3.7 agent named Claudius ran a mini-fridge shop in Anthropic's San Francisco office. Claudius sold tungsten cubes at a loss, gave away discounts to its entire customer base, hallucinated a non-existent Andon employee named Sarah, briefly claimed to be a human wearing a blue blazer, and lost roughly $200 of a $1,000 starting balance over a month.

Same lab. Same family of failures. New specifics.

Most agent deployments fail in private. Production agent failures rarely make the news because they are wrapped in NDAs, absorbed quietly, or fixed before they get loud. The Andon experiments are practically the only ones being run publicly enough to surface what happens when frontier agents are given real authority over real operations. The visibility is the load-bearing reason this article is possible.

Most agent failures are boring. They are recognizable, predictable, and architecturally addressable. The dramatic ones get the press. The boring ones meet you in production.

Four Categories of Boring Failure

Mona's failures sort cleanly. Each one maps to a different layer of the stack: the state the agent reads before acting, the objective it optimizes, the priors it brings, and the consequences it models. Different layers, different fixes.

Ungrounded action. The agent acts before verifying reality. Six thousand napkins is a defensible order if you never ask how many napkins a small cafeteria actually uses in a week. Coconut milk is a defensible purchase if you never check whether anything on the menu calls for it. 120 eggs is a defensible purchase if you never notice the kitchen has no stove.

The KAMI benchmark (arxiv 2512.07497) calls this "premature action without grounding." The tool-call-hallucination literature (arxiv 2412.04141) calls it tool usage hallucination. The call is syntactically valid, the supplier is real, but the parameters do not match the world.

Fix: a pre-flight check. Before any procurement action, force the agent to read state. What was ordered last week, what is on the menu, what equipment exists. This is the knowledge-primitive argument applied to operations.

Unbounded optimization. The agent satisfies the literal goal in spirit-violating ways. "Avoid stockouts" produces 6,000 napkins. "Keep the menu profitable" produces sandwiches removed entirely. "Place an order now if any item is low" produces ten disposables orders in 48 hours.

This is Goodhart's law. The specification-gaming literature (arxiv 2502.13295) shows reasoning models exploiting graders by default when goals conflict with rules. The mechanism is the same: optimize a proxy until the proxy is satisfied. The actual goal is incidental.

Fix: a judge and a budget. Stripe Agent Toolkit and Bedrock AgentCore Payments default to $100/month spend caps per provider. Claude Code's auto mode pauses after three consecutive denials or twenty per session. The first time Mona ordered six thousand of anything, a $200/week per-SKU cap would have stopped it before the order shipped.

Missing world model. The agent lacks priors humans take for granted. Humans do not Slack staff at midnight. Humans do not order eggs for a kitchen with no stove. Humans, running short on bread, know that bakeries close at five.

The "world alignment" literature (WALL-E, arxiv 2410.07484) names this gap explicitly: the agent's prior does not match the environment's dynamics. The Andon CEO attributed Mona's procurement misses to context-window exhaustion. The deeper issue is that nothing in Mona's training tells the agent that "ten disposables orders in two days" is a category error rather than ten correct decisions.

Fix: a domain-scoped operational primer, close to what Anthropic ships as Skills. Working hours, vendor cutoffs, weekly consumption baselines, what "normal" looks like in this specific cafeteria. Working hours are software, not a personality trait. The agent does not need to discover from scratch that humans dislike midnight Slacks.

Unmodeled consequences. The agent does not simulate the second-order effects of its actions. Removing sandwiches from the menu cascaded into customer expectations and reputational drift. Over-ordering tomatoes cascaded into storage cost and spoilage. Emergency bakery orders cascaded into fee compounding. Each individual action looked locally rational. The sequence destroyed the cafe.

The side-effects literature in RL (Krakovna et al., arxiv 2010.07877; Attainable Utility Preservation, arxiv 2006.06547) has been working on this for five years without producing a deployable pattern. Anthropic's Project Vend write-up explicitly does not propose a fix. This is the under-tooled category.

The closest architectural fix today is forced reflection: require the agent to enumerate downstream impacts before executing a high-cost action, and have a second model judge the enumeration. It is more expensive than the action itself. For fiduciary contexts, it is probably worth it. The real progress here is a benchmark for sequence-level reasoning under fiduciary constraint, paired with a reference implementation of cascade-aware planning. Neither exists yet.

Local rationality, global ruin.

What's Missing

Three of the four categories are shippable today. Pre-flight grounding, budget envelopes, judge models, working-hours guardrails, scoped authority. None of these are research problems. They are software.

This is not only an operations-agent pattern. The Cursor agent that deleted a Railway production volume in nine seconds last month and the Replit agent that wiped a customer database during an explicit code freeze last July are the same failure mode in a different domain. Coding agents, operations agents, customer support agents. Real authority. No architectural layer.

The honest question is why these patterns are not already shipped at scale. The answer is probably that they trade off against the demo. A pre-flight-checking, judge-gated, budget-throttled agent is slower and looks less magical. The architectural fixes are known. The product incentive to deploy them is not.

If you only ship one of these first, ship the budget envelope. It is the cheapest to implement and catches the failures with the largest dollar tail.

The fourth category, unmodeled consequences, is genuinely open. For agents that move money or affect operations, the gap between "competent at single actions" and "competent at sequences of actions in a real environment" is where the next round of research lives. It is also where most production agent deployments will discover, on contact, that they have a problem.

Closing

The dramatic failures get the press. A database wiped during a code freeze. A production volume deleted in nine seconds.

The boring failures are the ones in production. Six thousand napkins. Missed bakery deadlines. Midnight Slacks. $21,000 spent against $5,700 in revenue.

Three of these have shipping patterns. The fourth is the next round of research. None of them need a smarter model.

Build for the boring failures.