When your AI vendor retires the model you rely on, your work can break overnight

The model your workflow runs on today has a retirement date, whether or not you have read it. When a provider deprecates a version or quietly updates one in place, the prompts you tuned, the evaluations you trust, and the costs you budgeted can all shift without a code change on your side. Treating that as a footnote in a vendor’s release notes is how a production agent breaks on a Tuesday with no one able to explain why.

This post lays out what major providers have actually done in 2024 to 2026, why “behavior drift” (the same prompt producing a different output after a model update) is hard to catch, and the playbook regulated teams use to stay in control of the model lifecycle instead of being subject to it.

Key takeaways

Model deprecation is scheduled, frequent, and documented. OpenAI commits to as little as 6 months notice for general models and 2 weeks for previews; Anthropic commits to at least 60 days before retiring a model.
Behavior drift is the quieter risk: a model can change in place under the same name, so a prompt that worked yesterday returns different output today with no version bump.
You cannot detect drift by reading. You detect it with a regression eval set built from your own real traces, run on every model change.
The durable fix is controlling the lifecycle yourself: pin a dated model version, or run open weights inside your own environment so nobody retires your model for you.
A control tower that records every run and can replay it turns 'something changed' from a mystery into a diff.

Is model deprecation actually a scheduled event, or an edge case?

It is scheduled, it is routine, and it is published. This is not a hypothetical risk you are insuring against. It is a recurring operations event with dates on it.

OpenAI’s deprecation policy commits to minimum notice periods: at least 6 months for generally available models, at least 3 months for specialized variants, and as little as 2 weeks for preview models (anything with preview in the name). OpenAI states plainly that preview models are not recommended for “business-critical production workloads unless you can migrate on short notice.” Real examples from the published list: gpt-4-32k and its dated snapshots were announced for deprecation on June 6, 2024 and shut down on June 6, 2025; gpt-4.5-preview was announced April 14, 2025 with a shutdown on July 14, 2025, a three-month window.

Anthropic publishes a parallel model deprecations page with a four-stage lifecycle (Active, Legacy, Deprecated, Retired) and a commitment to “at least 60 days notice before model retirement for publicly released models.” Claude Opus 3 (claude-3-opus-20240229) was deprecated on June 30, 2025 and retired on January 5, 2026. Claude Sonnet 4 and Opus 4 (the 20250514 snapshots) were deprecated on April 14, 2026 with a retirement date of June 15, 2026.

60 days minimum notice Anthropic commits to before retiring a publicly released model; OpenAI commits to as little as 6 months for GA models and 2 weeks for previews Anthropic and OpenAI deprecation docs, 2026

The point is not that these providers behave badly. Both publish their schedules, give notice, and recommend replacements. The point is that a retirement date is a property of every model you use, the clock is always running, and “we will deal with it when the email arrives” is not a plan when the email gives you 60 days to revalidate a regulated workflow.

What is behavior drift, and why is it harder than deprecation?

Deprecation is the loud risk. You get an email, a date, and a recommended replacement. Behavior drift is the quiet one: the model changes while keeping the same name, so nothing in your code or your vendor’s changelog tells you anything moved.

The clearest documented case predates the current generation but remains the canonical reference. In 2023, researchers Lingjiao Chen, Matei Zaharia, and James Zou published How is ChatGPT’s behavior changing over time?. They ran the March 2023 and June 2023 versions of GPT-4 on the same tasks. On identifying prime versus composite numbers, accuracy fell from 84% in March to 51% in June. The paper also found that the share of directly executable code generations dropped sharply over the same window. The model was called “GPT-4” the whole time. The authors’ conclusion is the line every operations leader should internalize: “the behavior of the ‘same’ LLM service can change substantially in a relatively short amount of time.”

Why is this harder to manage than a retirement?

A deprecation has a date and a notification. Drift, when a model is updated in place behind a stable alias, may have neither.
Aliases hide it. OpenAI’s docs distinguish pinned snapshots (a dated version like gpt-4o-2024-05-13) from unpinned aliases (a dynamic pointer like gpt-4o that can move). If your code calls the alias, you inherit whatever the provider points it at next.
It is invisible to inspection. You cannot read a prompt and see that it will now behave differently. The change only shows up in outputs, against real inputs, at the distribution level.

How would you even detect that a model changed under you?

Not by reading release notes, and not by spot-checking a few prompts by hand. You detect drift the same way you detect a regression in any other software dependency: with a test suite that runs on every change. For models, that suite is a regression eval set.

A regression eval set is a fixed collection of representative inputs with known-good expected behavior, scored automatically, that you run whenever the model changes (a new version, a forced migration, or a routine recheck). The critical detail for regulated teams: build it from your own real traces, not from a generic public benchmark. A public benchmark tells you the model is good at someone else’s tasks. Your traces tell you whether it still handles your claims, your contract clauses, or your prior-authorization letters the way it did last month.

Approach	What it catches	What it misses
Reading the provider changelog	Announced deprecations and major version bumps	In-place updates, silent alias moves, subtle behavior shifts
Manual spot-checks	Gross, obvious failures on a handful of cases	Distribution-level drift; the 5% of cases that now fail
Public benchmarks	General capability of the new model	Whether it still performs on your specific workflow
Regression eval set from your traces	Task-specific behavior change, before it reaches users	Only as good as the coverage of your trace sample

The discipline here is the same one we describe in cost per completed task: the number that matters is measured against your real work, not a vendor’s demo. A regression suite is the quality counterpart to that cost discipline. Pair both and a model change becomes a measured event with a pass or fail, not a surprise your users report for you.

What is the enterprise playbook for staying in control?

The goal is to convert an external event you cannot schedule into an internal one you can. Three controls, in increasing order of how much they reduce your exposure.

Pin the version, never call a moving alias

Always call a dated snapshot, never a floating alias, in any workflow that has been validated. Pinning does not stop deprecation (the snapshot still has a retirement date), but it removes the silent-update class of surprise entirely. A pinned model changes when you change it. Treat the model identifier the way you treat a container image tag: a specific version your build references, recorded in source control, never latest.

Keep a regression eval set built from real traces

Maintain the eval set described above as a living asset. Refresh it as your traffic shifts so it keeps representing real work. When a deprecation notice lands or a new version ships, you run the suite against the candidate and read a diff, not a vibe. This is what turns the provider’s 60-day clock from a scramble into a checklist.

Prefer deployments where you own the lifecycle

The strongest control is to take the lifecycle off the vendor’s calendar. Two ways to do it:

Open weights in your own environment. A model whose weights you hold and run inside your VPC, on-premise estate, or sovereign cloud does not get retired by a third party. You upgrade when your evals say the new version is at least as good, not when an external date forces you. You also keep regulated data inside your trust boundary, which is the same reason regulated teams keep AI in-house in the first place.
Contractual version pinning. Where a hosted model is unavoidable, negotiate a defined support window and advance-notice terms in the contract, rather than relying on a public policy page that can change.

Where does the control tower fit?

A control tower is the operational layer that records every agent run (the inputs, the model and version, the tool calls, the cost, and the output) and lets you replay any specific run later. That capability is what makes the playbook above enforceable rather than aspirational.

When a model version changes, replay is the difference between “something is off” and “here is the exact diff.” You re-run your regression set against the candidate version and compare, run by run, against the recorded baseline. You see which cases moved, by how much, and on what kind of input. Because every run already carries its model identifier, you can also prove which workflows are still calling a soon-to-be-retired snapshot, the same audit Anthropic recommends teams run against their own usage before a retirement date.

For regulated workflows, that recorded, replayable history is also the audit trail. When a regulator, an auditor, or general counsel asks “what model produced this decision, and would it produce the same one today,” the honest answer requires a pinned version and a trace you can replay. Proqtor’s platform is built to keep that evidence inside your environment, and our security posture documents how that boundary is maintained. For the broader pattern of keeping AI inside the boundary, see the trust boundary discussion across our other notes.

FAQ

How much notice do model providers actually give before retiring a model?

It varies by provider and by model tier, and it is documented. OpenAI commits to at least 6 months for generally available models, at least 3 months for specialized variants, and as little as 2 weeks for preview models. Anthropic commits to at least 60 days for publicly released models. Both notify affected customers by email and list dates on their public deprecation pages. The practical implication: read those pages on a schedule and treat preview models as unsuitable for workflows you cannot migrate quickly.

What is the difference between a deprecated model and behavior drift?

Deprecation is an announced retirement: the model is going away on a known date and you migrate to a replacement. Behavior drift is when a model changes while keeping the same name, so the same prompt returns a different output with no version bump and sometimes no notice. Deprecation is loud and scheduled; drift is quiet and only visible in outputs. You manage deprecation by tracking dates and pinning versions; you catch drift with a regression eval set run on every change.

Does pinning a model version fully protect my workflow?

No, and that is the honest limit. Pinning a dated snapshot removes the silent in-place-update risk, which is the most insidious one, but the snapshot still has a retirement date set by the provider. Full control over the lifecycle requires either open weights you run inside your own environment, where no third party can retire your model, or contractual terms that fix a support window. Pinning plus a regression eval set is the strong minimum; owning the deployment is the durable answer.