Open Weights, Closed Door
I have a 5090 and can't run the best open model. The ones smart enough for serious work need a datacenter, and renting one rarely beats the API.
Notes on AI architecture, agent systems, latency, and building production AI products.
I have a 5090 and can't run the best open model. The ones smart enough for serious work need a datacenter, and renting one rarely beats the API.
Yesterday a product I work on went down in the Anthropic outage. The cheapest redundancy is the same model on a second host, and it costs almost nothing.
Z.ai shipped a million-token context window this week with zero benchmarks. A window is a claim, not a capability. How to measure the one that matters.
On Chatbot Arena your phone's AI looks like it kept up with the frontier. On a real reasoning test, the gap is widening. Here is why, and what it means.
A Vercel team deleted most of its agent's tools and it got more reliable, on the same model. The lever is how many tools the model sees when it picks.
Cursor's pricing change cost it more trust than money. Most SaaS pricing assumptions break under AI's variable cost-of-goods. A taxonomy of what's working.
An AI agent ran a real Stockholm cafeteria. It failed in mundane, recognizable ways. Each maps to architectural patterns we already know how to build.
Anthropic just paid for 220,000 of a competitor's GPUs because their own buildout can't be accelerated. Capacity, not capability, is the 2026 AI frontier.
RAG was robust because the retrieval decision was removed from the model. Memory, Skills, and agentic search hand it back, and demand more in return.
Putting safety rules in the system prompt fails in production. The pattern that works: a written constitution, a separate judge, and a retry loop.
A better model can make your system worse. Production prompts fossilize around old failure modes. A more literal model follows them too literally.
Every AI system has two performance curves. The peak is the demo. The floor is production. Most teams build for the peak and then spend six months firefighting. The alternative is simple, unglamorous, and what actually ships.
Building AI systems got dramatically easier. Evaluating them didn't. That asymmetry, and Goodhart's Law, explain why most agent deployments still fail.
Most browser agent benchmarks measure whether a task completes. The more useful question is whether it completes faster than a human would — and what it takes to get there.
Amazon's AI coding incidents weren't caused by bad tools. They were caused by development processes that weren't redesigned to match the new risk profile.
Voice AI sits at the extreme end of the latency spectrum. The pipeline, the UX, and the failure modes are fundamentally different from chat.
Most agent systems feel slow because they are architected to be slow. The model is rarely the bottleneck anymore, and measuring everything else is where the real gains are.
The gap between a compelling demo and a reliable production system is where most AI projects die. The problems are predictable, and the fixes are systematic.
Most teams asking for an agent really need a workflow. Autonomy has a cost, and the default should be simpler than you think.
Traditional testing gives you pass or fail. AI systems live in a probabilistic space where correctness is a spectrum and ground truth itself is unreliable.
RAG was a breakthrough, but its limitations are clear. When your knowledge base is complex, dynamic, or multi-modal, you need architectures that go further.