A Bigger Window Is Not a Bigger Memory

On June 13, Z.ai shipped GLM-5.2 with a one-million-token context window and not a single benchmark.

No SWE-bench, no coding score, and most tellingly, no long-context test at all. The million-token window is real in the narrow sense that the API will accept that many tokens. You switch it on with a config flag. What nobody published is whether the model can still find the right line at token 600,000.

Its predecessor makes the gap sharper. GLM-5.1 launched in the spring with a full benchmark sheet: a SWE-Bench Pro number, a Terminal-Bench number. Even that sheet had no long-context score. Z.ai knows how to publish benchmarks. It has just never published one for the window it is selling.

A context window is a claim, not a capability. The spec sheet tells you how many tokens fit. It does not tell you how many the model can actually use before it starts missing things.

The window you advertise, the window you can use

Two benchmarks were built specifically to measure the second number, and they agree.

NoLiMa, from Adobe Research, strips the keyword overlap out of long-context tests, so the model has to reason about where the answer is instead of matching a word. It is built to be hard, which makes its numbers a floor, not an average. It defines a model's effective context as the longest input where the model still holds 85 percent of its short-context accuracy: the point past which it starts getting worse at finding things.

The effective numbers are nowhere near the advertised windows. Gemini 1.5 Pro advertises a two-million-token window and holds about two thousand. A thousand to one. GPT-4o: 128K claimed, 8K effective. Claude 3.5 Sonnet: 200K claimed, 4K effective. At 32,000 tokens, ten of the twelve models tested had already fallen to half their short-context accuracy or worse.

RULER, from NVIDIA, uses a different method and lands in the same place. GPT-4 and Llama 3.1 70B both advertise 128K and hold up to about 64K. Yi-34B advertises 200K and holds 32K. Across seventeen models, only half were still reliable at 32,000 tokens.

Notice what RULER could not test. It caps out at 128K. There is no rigorous public benchmark that resolves effective context at one million tokens, let alone the ten million Meta advertises for Llama 4 Scout or the hundred million Magic.dev claims. The biggest windows on the market are not measured. Nobody can prove the claim wrong, because nobody has built the test.

There is a plain reason the far end of the window is the weakest. Most long windows are reached by training on much shorter sequences and stretching the window after the fact. The last stretch of a million-token window is the part the model saw least.

Degradation is not uniform, and the favorite test hides it

The drop-off is not a clean line at the end of the window. It bends with three things.

Position. A fact in the middle of a long context is retrieved less reliably than the same fact at the start or the end. The curve is U-shaped, a result old enough to have a name: lost in the middle.

Distractors. Add one plausible-but-wrong passage and accuracy falls. Add more and it compounds. Chroma's context-rot study found this across all eighteen frontier models it tested, including Claude Opus 4 and Gemini 2.5. Chroma sells a vector database, so read them with that interest in mind. The independent academic benchmarks point the same way.

Similarity. When the question shares words with the answer, the model can shortcut to keyword matching. When it does not, it has to reason, and reasoning over long context is where it breaks first.

The test vendors love to quote is the one that tells you the least. Needle-in-a-haystack, hiding one sentence in a long document and asking the model to retrieve it, is keyword matching at the easiest setting. HELMET, a study across 59 models, states plainly that synthetic needle tests do not predict downstream performance. On LongBench v2, a set of realistic long-context tasks, the best model answers about 50 percent of the questions; human experts get 54. A perfect needle score and a coin flip on real work can be the same model.

Even when it works, the window is a bill

A full window costs you for reasons that have nothing to do with quality.

Google charges for it directly: Gemini's input price doubles past 200,000 tokens, with Gemini 3.1 Pro going from $2 to $4 per million. A single 900,000-token call is about $3.60 of input alone, a recurring tax you pay on every call you make. Anthropic adds no surcharge, but the underlying cost is the same. Filling a context is compute, and the attention part of that compute is quadratic: double the tokens and you quadruple that work. Meta reported prefilling a one-million-token prompt in 77 seconds, and only by spreading one request across 128 GPUs. On a normal endpoint you feel that as latency.

GLM-5.2 ships a quiet admission of the gap. The million-token input window comes with a maximum output of 131,072 tokens, about 13 percent of the input. The window you read into and the window you write out of are different budgets.

So measure it

The number you actually want is on no spec sheet, because it depends on your data and your task. It is your effective context length: the input size at which your success rate drops below what you can tolerate. It is a spec you hand an engineer, half a day of work, not a project. Here is the recipe.

Test at your real sizes, not the model's maximum. Sweep the input lengths you actually use, 8K, 16K, 32K, 64K, and watch where accuracy falls off. The cliff is usually far below the advertised window.

Put the critical fact at different depths. Start, middle, end. If the middle is worse, you have a positioning problem you can design around.

Use realistic distractors and low-overlap questions. Draw the noise from your own corpus so the answer does not stand out, and phrase the question so the model cannot keyword-match its way to it. That is the difference between a needle test and a real one.

Then score end-to-end success, not survival. "It did not crash at a million tokens" is not a result. "It got the right answer 92 percent of the time at 64K and 60 percent at 256K" is.

The one question to put to your team is the one no vendor answers: what is our effective context length on our own task? Run the test once and you will trust a measured number over a marketing one from then on.

None of this means a big window is useless. When the whole input genuinely has to be read at once, a long contract, a codebase reviewed in a single pass, a wide window earns its cost. The point is to confirm that, not assume it. And when your effective length is 32K and your corpus is 500K, the answer is retrieval, not a bigger window. A curated index beats a stuffed one for the same reason: fewer distractors. That is the same call I reached from the other direction in Three New Knowledge Primitives.

GLM-5.2's million-token window may turn out to be excellent. The point is that nobody, including the people who shipped it, has shown that it is.

A window is a claim. Effective context is a measured number. They are rarely the same, and the only one that matters for your system is the one you measure yourself.

A Bigger Window Is Not a Bigger Memory

The window you advertise, the window you can use

Degradation is not uniform, and the favorite test hides it

Even when it works, the window is a bill

So measure it

Open Weights, Closed Door

One Model, Three Front Doors