Open Weights, Closed Door

I have an RTX 5090 and 64 gigabytes of RAM. The best open-weights model released this month is free, MIT-licensed, and sitting on Hugging Face right now. I cannot run it. Neither can you.

That sounds like a hardware complaint. It is actually the whole argument.

GLM-5.2 is 753 billion parameters. The full weights are 1.51 terabytes. The smallest shrunk-down version anyone has produced, a quant, is 217 gigabytes, and it needs around 223 gigabytes of memory to run. Counting the 32 gigabytes on my card and the 64 of system memory, I have 96 in total, and for fast inference only the 32 really count. A 5090 is the top of the consumer line. So the model is open in every sense the license cares about, and closed in the only sense that matters to me: I can read it, modify it, redistribute it, and not run it.

This is supposed to be the good-news story of 2026. Open weights caught up, and they did, genuinely. On Artificial Analysis's intelligence index, GLM-5.2 scores 51 against Claude Opus 4.8's 56, about five points off the closed frontier. A 27-billion-parameter model that fits on my card scores about 68 percent on SWE-bench Verified, a standard coding test, close enough to the headline numbers that it looks like the gap has closed. "Run the open model yourself, for control and for savings," has never sounded more plausible.

It is still wrong. Not for the old reason, that open models are too weak. For a sharper one. It is a gap, and it is widening.

The two sides

One side is what you can run. On a 32-gigabyte card you can run a very good 27-to-35-billion-parameter model: GLM-4.7-Flash, Qwen3-30B. At four-bit they are about 18 gigabytes, they are fast, and on a single, well-scoped task they are close to the frontier. That is real, and it is new.

The other side is what you can trust. A model stopped being an assistant somewhere last year and became an implementation partner: you hand it a multi-step job, the kind you would once have given a junior engineer for an afternoon, and walk away. The bar for that is not "can it fix one bug." It is "can it run for an hour, recover from its own mistakes, and not quietly corrupt the work." That is exactly where the small models fall off a cliff.

The single-issue coding score hides this, because it is the one thing small models do well. The long-horizon scores do not. On Terminal-Bench 2.0, which measures multi-step terminal work, the closed frontier sits around 80 to 85; the 30-billion model you can actually run scores in the 20s to low 30s; the small ones bottom out near zero. METR measures the same thing as time: the length of task a model can finish unattended. The frontier is around five hours. The longest-horizon open model is about twenty-seven minutes, roughly a tenth of the frontier, and it is a 671-billion-parameter model you cannot host.

And no, you do not fix this by fine-tuning a small model on your own workflow. You can teach a 30-billion model your format, your codebase, your conventions. You cannot teach it to recover from a mistake forty steps into a task it has never seen. Long-horizon robustness scales with the model. It is not a skill you graft on.

This is the floor rising. I have argued that you should build for the floor, design around what the model does reliably, not what it does impressively. For serious agentic work the floor has risen to frontier-class, and it has risen above what you can run yourself.

The benchmark you are sold runs on hardware you do not have

A short detour, because it is where most "open model matches the frontier" claims come from. The same 27-billion weights that score 67.8 percent on a fair harness post 88 to 90 percent in the headline. The difference is not the model. It is a forty-seven-thousand-line scaffold, with retries, running on twelve RTX 4090s and spending compute on every task to take the best of many attempts.

That is not weights you download. It is test-time compute and months of engineering, and it bills you on every task in production. The model that fits on your card and the result in the press release are not the same system. When you self-host, you get the weights. You do not get the scaffold.

The models worth trusting need a datacenter

So you reach up a tier, to the open models that actually clear the bar, and the door closes the moment it meets the hardware.

DeepSeek-R1 is 671 billion parameters. It is a mixture-of-experts model, so only a fraction of it fires on any given token. That is cheap to compute and does not help here: every expert still has to sit in fast memory so the model can route to it. Its FP8 weights are about 671 gigabytes, which does not fit on an eight-GPU H100 node (640 gigabytes). You need eight H200s, 1.1 terabytes, a datacenter line item.

You can cheat it by parking the cold experts in system RAM or on an SSD and paging them in, which is how people "run" these models on one card. The one prosumer machine that could even load this tier was a 512-gigabyte Mac Studio, and it runs the model two ways, both bad. Squeeze it to one-bit and reliability craters: Aider's coding pass rate falls from 71.6 percent at full precision to 55.7 at one-bit. Keep the quality and you get nine tokens per second of prompt processing, which means an eight-thousand-token prompt takes almost fifteen minutes just to read, before the agent has done a thing, in a loop that will feed it far more than eight thousand tokens. Speculative decoding speeds up generation, not that first read, so the wall stays where it is.

The door is not locked. It opens onto a room too slow to work in.

This is the same gap I traced on the phone, one tier up. There the question was on-device versus the cloud. Here it is the workstation versus the datacenter. The shape is identical: the capability you can hold in your hands stops just below the capability you would actually rely on.

Renting the datacenter does not rescue the economics

The obvious move is to rent the cluster instead of owning it. It does not save you money, and the reason is batching.

GLM-5 runs about $1.92 per million output tokens through a hosted provider. Self-hosting that same model on a rented eight-H200 node lands somewhere around $2 to $4 per million output tokens at full utilization, before you have paid a single ops engineer. Read that again: even saturated, the cluster you rent costs more than the endpoint someone else rents. A hosted provider fills every slot on the GPU across thousands of customers at once; your single-tenant node cannot, so it never reaches their price even at full tilt.

And your node is billed by the wall-clock hour whether it serves a billion tokens or zero, while agent traffic is bursty, the spiky, variable cost profile that breaks flat assumptions: it spikes against a node, it does not saturate it. So utilization does not decide whether self-hosting wins on cost. It decides how badly it loses. Roughly double the API at moderate load, several times that when the node sits idle between spikes.

So what does self-hosting actually buy you

Three honest answers.

Capability: no. The models smart enough to trust need a cluster you would rent, not own.

Cost: no. The same model is cheaper from someone else's endpoint, at every level of utilization.

Residency and control: yes, but less uniquely than the pitch implies. The reason most teams say they want to self-host is "our data cannot leave." For the large majority, the hosted endpoints already cover that. AWS Bedrock will pin inference to a region and contractually not train on your inputs; Google Vertex offers EU-resident endpoints; the open leaders run on both. That is regional residency and a no-training guarantee with no GPU in the building. What only self-hosting gives you is physical sovereignty: the data never leaves your hardware, no third party at all. That matters when a contract is not enough, because a region-pinned endpoint on a US provider is still legally reachable under the CLOUD Act. That is the real reason a defense or health or finance team air-gaps, and it is a far smaller set of teams than the number who assume they are in it.

So the real choice for serious work is between two hosted doors, not three. A hosted open-model endpoint when the task is well-scoped and residency and cost matter. A closed frontier API when the work is the long-horizon kind you actually have to trust, where five points of intelligence and ten times the task horizon are the difference between done and quietly broken. Self-hosting is the third door, and it is only worth opening when the data is legally forbidden from leaving your building.

The open weights are real, and the best of them are nearly frontier. They are just not, for the work that has started to matter, yours to run. For everyone outside the air-gap, self-hosting is a more expensive way to run a weaker model.

The weights are open. The door is not.

The two sides

The benchmark you are sold runs on hardware you do not have

The models worth trusting need a datacenter

Renting the datacenter does not rescue the economics

So what does self-hosting actually buy you

One Model, Three Front Doors

A Bigger Window Is Not a Bigger Memory