Back to Insights
Architecture

Your Prompts Are Not Portable

By Mathijs Boezer

A better model can make your system worse.

Not sometimes. Not in corner cases. As a common enough outcome that you cannot assume a clean upgrade is the default.

OpenAI's GPT-5 prompting guide describes Cursor's `<maximize_context_understanding>` block, a prompt that pushed older models to be more thorough, causing GPT-5 "to overuse tools by calling search repetitively, when internal knowledge would have been sufficient." Anthropic's Claude Opus 4.7 migration guide, published this week, is blunter: scaffolding written for 4.6 now needs to be removed, not kept.

The prompts did not get worse. The models got better at following them.

I have been watching the same pattern in production migrations since GPT-4o. The signature is usually the same: evals green, tool calls creeping, response length creeping, until someone pings engineering about support tickets. Nothing in the code changed. The prompt did not change.

The prompt was calibrated to the failure modes of the old model. The new model has different failure modes. And it is following the old instructions more literally than the old model ever had.

Most teams will read that as a regression in the new model. It is a regression in the prompt.

Fossilized Workarounds

Production prompts accumulate like code. Each line is usually a fix for something that went wrong once. But unlike code, nobody writes a changelog for a prompt.

"Be extremely thorough." Because the model was skipping steps. Now it actually is EXTREMELY thorough. A yes-or-no question comes back as a literature review. "Be perfectly accurate." Because the model was confidently wrong. Now it is paralyzed. Three verification rounds before it will commit to a fact. "Never ask the user unless absolutely necessary." Because the model was asking too much. Now it invents answers it should have asked about.

I call these fossils. Each one is the cast of a failure mode that is no longer in the rock around it. Each one solved a real problem for the model it was written for. None of them describe a goal. They describe a workaround, and the workaround only works if the thing it is working around is still there.

A model that reads instructions more literally reads the workarounds more literally too.

Both Labs Say the Same Thing

"GPT-5 follows prompt instructions with surgical precision." OpenAI's own framing. For clean prompts, that is an upgrade. For prompts full of contradictions and vague phrasing that older models half-ignored, it is the opposite.

The Cursor case study in the GPT-5 guide shows the pattern cleanly. Cursor's `<maximize_context_understanding>` block told the model to "be THOROUGH" and "make sure you have the FULL picture." On older models, that was a push the model needed. On GPT-5, which OpenAI describes as "already naturally introspective and proactive at gathering context," the same instruction became the problem. Cursor's fix was to drop the "maximize" framing and soften the thoroughness language.

Anthropic's Opus 4.7 migration guide names the behavior in plain language:

"Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6, particularly at lower effort levels. It will not silently generalize an instruction from one item to another, and it will not infer requests you didn't make."

And it flags legacy scaffolding by name:

"If you've added scaffolding to force interim status messages (\"After every 3 tool calls, summarize progress\"), try removing it."

Claude Opus 4.7 also uses fewer tool calls by default than 4.6. If you wrote scaffolding that forced tool use to compensate for 4.6 being too eager to answer from memory, 4.7 inverts the problem. The old bandaid becomes the new bug.

The pattern shows up outside vendor documentation too. Chris Tyson, writing publicly about his migration to Claude Sonnet 4.5, counted 17 instances of "MUST" and 11 instances of "ALWAYS" in his own production prompt. On the older model, the imperatives got followed literally. On 4.5, Tyson writes, the model "evaluates whether following your literal commands actually serves the user's apparent goal. If those conflict, context wins." The prompts written as drill-sergeant imperatives silently degraded in the direction of the inferred intent, not the written rule.

Model Drifting Is Now a Measured Quantity

Prompt sensitivity is not a craft problem. It is structural, and by now it is documented.

What Prompts Don't Say, a 2025 Carnegie Mellon paper, studied prompt underspecification directly. LLMs guess unspecified requirements 41% of the time. But "under-specified prompts are 2x as likely to regress across model or prompt changes," sometimes with accuracy drops over 20%. Stronger models infer more, not necessarily the same things. Formatting requirements get guessed 70% of the time. Conditional requirements ("only do X if Y") get guessed 23% of the time.

Most migration pain lives here. The old model happened to guess the unstated requirement the way you wanted. The new one guesses a different way. Nothing in the prompt changed. The behavior did, because the prompt was never specifying the behavior in the first place. It was specifying a workaround.

"Just be more specific" is not a clean answer either. The same paper found that bundling 19 requirements into one prompt dropped accuracy from 98.7% on each rule individually to 85% on GPT-4o and 79.7% on Llama 3.3. Rule soup is its own failure mode. Adding more rules to stabilize a prompt can make the prompt worse.

I have usually described this to clients as drift: the prompt stays still, the model moves under it. A December 2025 paper, PromptBridge, puts a proper name on it. They call it "model drifting." When you replace the underlying model, the prompt that was best for the old one is usually not best for the new one. On coding benchmarks, direct transfer leaves double-digit relative performance on the table. The authors measure a 13.6% gap on HumanEval, and rebuilding the prompt for the target model recovers over 27% on SWE-Bench Verified and over 39% on Terminal-Bench compared with direct transfer.

Goals Survive Migration. Rules Don't.

Prompt compatibility has no clean solution. But both the research and the vendor guides point the same direction: prompts built around goals and output contracts are more portable than prompts built around accumulated rules.

A rule-oriented prompt tells the model what to do at each step. A goal-oriented prompt tells it what the output must look like, what counts as done, and what it is allowed to touch to get there.

One caveat, because the field is not uniform. In compliance, safety, and strict-format domains, the rule is the contract: verbatim disclaimers, hard refusal categories, schema-locked outputs. Those rules should stay explicit. What drifts there is not the rule but the scaffolding around it.

Anthropic's prompting guidance makes the related point: "Tell Claude what to do instead of what not to do." Positive examples of the target behavior outperform negative instructions. That is not a tone preference. It is a portability argument. A negative instruction ("don't do X") depends on the model reliably knowing what X is and being tempted to do it. The scope of X can shift across models in ways you never wrote down.

A migration-friendly prompt structure looks less like a list of seventeen rules and more like four sections.

  • Goal. The outcome the model is responsible for.
  • Output contract. Format, structure, what counts as complete.
  • Tool policy. When to use tools, when not to, stop conditions.
  • Ambiguity policy. A specific rule for what underspecified means in this domain. Not "ask when unsure." Something like: "if the user does not specify a date range, default to the last thirty days and say so."

One concrete before-and-after. The old prompt read like this:

"Be thorough. Always check all available context. Use tools whenever possible. Never ask the user unless absolutely necessary. Be concise but complete. Never hallucinate. Follow these rules in order."

The four-section rewrite of the same system:

Goal: Answer the user's question, citing the source when private data is used.

Output contract: Recommendation first, then reasoning, then open questions. Citations inline.

Tool policy: Use the search tool when the answer depends on private data or current events. Do not use it for stable background knowledge. After two empty searches, stop and surface what you could not verify.

Ambiguity policy: Ask one clarifying question only if the missing information would change the answer. Otherwise make a reasonable assumption and label it.

The first tells the model how to behave. The second tells it what to deliver. Any change in how literally the model reads instructions breaks the first. The second does not care.

Rules age. Goals don't.

Prompt Operations

Prompts are not configuration. They are a behavioral contract between your product and a specific model's specific failure modes on a specific day. They have versions. They need regression tests. They need rollout plans.

In practice that is a few specific habits. Every prompt in production is checked into the repo and has an owner. The regression set is not a list of ideal cases. It is a sample of last month's real traffic, including the failures. A model upgrade is rolled out behind a flag to a slice of traffic, with the old and new prompts both running, and the diffs in behavior watched before the switch. Legacy scaffolding is stripped during the migration, not after.

The envelope around the prompt also has to be re-checked. Anthropic recommends re-benchmarking end-to-end cost and latency on every Opus 4.7 migration because even the tokenizer changed: the same text uses 1x to 1.35x as many tokens as on previous models. Datadog's 2026 State of AI Engineering report describes a production surface that has quietly grown into a fleet. 70% of organizations now run three or more models, and around 5% of requests fail in production.

The teams that absorb the next model release without incident are the ones that already treat prompts like code: versioned, tested on real traffic, rolled out behind flags, and stripped of dead scaffolding before the new model has a chance to follow it.

A better model does not retire your prompt. It audits it. An honest audit asks three questions of every line. What failure mode was this written for. Does that failure mode still exist. What does the line do if it does not.

What survives describes the goal. What doesn't describes a model that no longer exists.