Why 40% of agent projects get cancelled, and the four controls that save them

Gartner projects that over 40% of agentic AI projects will be cancelled by the end of 2027. The stated reasons are escalating costs, unclear business value, and inadequate risk controls. None of those are claims about whether the model can reason. They are claims about whether the project can be funded, trusted, and approved over time.

This matters because the usual response to a stalled agent project is to swap the model. Teams move from one frontier model to the next, expecting the cancellation risk to drop. It rarely does. The model was almost never the binding constraint. The constraint is operational: where the data is allowed to go, who can read the bill, who can read the failure, and who signs off on letting the agent act without a human in the loop.

There is a second number worth putting next to Gartner’s. McKinsey’s 2025 work reports that 88% of organisations now use AI somewhere, but roughly two thirds have not moved it past pilots, and the trait that separates the teams that scale from the teams that stall is workflow redesign, not model selection. The cancellations and the stalls are the same story told twice. The work that survives is operationally defended. The work that dies was a good demo nobody could put into production.

This is a field analysis of the four failure modes we see end a project, and the four controls that keep it alive. The pattern is consistent enough to plan against.

The failures are operational, not model quality

A useful test: if your agent demo works on synthetic data in a vendor sandbox and then dies on contact with the real organisation, the model is not your problem. The handoff from “interesting” to “in production” is where projects die, and it dies on four predictable obstacles.

Failure 1: the data cannot leave, so adoption never starts

The first conversation with a frontier agent vendor usually ends at the same place. The data the agent needs is customer PII, contracts, claims, patient records, source code, or financial positions. Sending that data to a third-party endpoint is a transfer the security team will not sign, and under GDPR and, for financial entities, DORA, they are correct not to. The project does not fail in production. It fails before the first real query, because nobody can lawfully feed the agent anything worth automating.

Failure 2: no cost attribution, so finance kills it

The pilot runs on a shared API key and a flat monthly invoice. Six weeks in, the bill has tripled and nobody can say which workflow drove it. Was it the high-value claims triage, or the low-value internal FAQ bot somebody wired up on a Friday? When finance asks for unit economics and the answer is a single aggregate number, the project loses its budget defence. Gartner’s “escalating cost, unclear value” is usually this: not that the agent was expensive, but that nobody could attribute the expense to a result.

Failure 3: no debuggability, so one bad run erodes trust

An agent produces a wrong answer in front of a senior stakeholder. Someone asks the reasonable question: what happened on that run? If the answer is “we cannot reconstruct it,” trust does not degrade gradually. It collapses. One unexplained failure with no trace is worth more political damage than fifty quiet successes are worth credit. Without the ability to replay a specific run and show the inputs, the tool calls, and the decision points, every error becomes an argument to shut the project down.

Failure 4: no autonomy path, so risk and legal block it

The team wants the agent to act: send the email, update the record, move the ticket, release the payment. Risk and legal want to know what happens when it acts wrongly, and what the rollback is. With no graded path from “suggests” to “acts under approval” to “acts autonomously,” the choice collapses into a binary. Either the agent is a read-only toy nobody values, or it is an unbounded actor nobody will approve. Under the EU AI Act, deployers of higher-risk systems carry documentation and human-oversight obligations, so “we will let it act and see” is not a posture legal can accept.

The four controls that save a project

Each failure has a control. The controls are not exotic. They are the AgentOps equivalent of the things you already require of any system that touches production: a boundary, a meter, a log, and an approval gate.

Control 1: a trust boundary, deploy in the customer environment

Move the agent to the data, not the data to the agent. Deploy inside the customer’s VPC or on-premise estate, so PII and regulated records never cross the trust boundary. Keys live in the customer’s KMS. Where a model API is unavoidable, route it through controls the customer owns, and document the basis for any transfer (SCCs, regional endpoints) before the first production query. This is what turns “we cannot start” into “we can start on real data on Monday.”

Control 2: cost-per-completed-task attribution

Meter at the workflow level, not the account level. Tag every token, tool call, and retry to a specific workflow and, where possible, a specific completed task. The number finance needs is not “spend last month.” It is “cost per resolved claim” or “cost per closed ticket.” When you can put unit cost next to unit value, the budget conversation changes from “this is getting expensive” to “this costs 3.40 EUR per completed task and replaces 25 EUR of manual handling.” That is a defensible number.

Control 3: traces, replay, and evals

Instrument every run so it can be reconstructed. A trace records the inputs, retrieval (RAG context), tool calls, and intermediate decisions, ideally under the OpenTelemetry GenAI semantic conventions so the format is not a vendor’s private dialect. Replay lets you re-run the exact failing case. Evals turn one-off debugging into a regression suite, so a fix for one bad run becomes a test that prevents the next one, and so you can change a prompt, a model, a retriever, or a tool and prove the change before it ships. The goal is that when a stakeholder points at a wrong answer, you respond with the run, not an apology. Debuggability is what converts a trust-eroding incident into a closed ticket.

Control 4: a graded autonomy ladder with approvals

Replace the binary with a ladder, and make each rung an explicit decision.

Rule of thumb: an agent earns the next rung of autonomy by accumulating evidence at the current one, not by passing a demo.

A workable ladder runs from suggest-only (a human does everything), through draft-plus-approval, recommended action, execute-with-approval, execute-under-policy, to autonomous with humans on exceptions. Each rung maps cleanly to RBAC, to value and blast-radius limits, and to the human-oversight expectations under the EU AI Act. Risk and legal can approve the early rungs on day one because the rollback is a human saying no. The promotion from one rung to the next is read from the trace record, not promised in a slide: approval rate, failure rate, replay coverage, business-outcome quality, and a rollback that has actually been exercised.

Every agent needs an owner, not just a prompt

The four controls share a precondition that is easy to skip and expensive to skip: each agent has to be a named, governed object before it touches production, not an anonymous prompt in someone’s notebook. We write this down as a skill manifest, and the act of filling it in surfaces most of the four failures before they happen.

A manifest for one agent records its name and owner, the workflow it serves, the users and roles allowed to invoke it, the tools it may call, its knowledge sources, the approval requirements, its risk tier, a cost cap, its logging, the eval tests that gate any change, and the failure and rollback policy. A short one looks like this:

name:        invoice-exception-resolver
owner:       finance-ap@customer
workflow:    accounts-payable / exception handling
allowed:     ap-clerk, ap-manager
tools:       erp.read, erp.match, ticket.create   (no erp.post)
knowledge:   vendor-master, po-history, ap-policy  (permissions-aware)
risk_tier:   2 (financial, reversible)
cost_cap:    0.60 EUR / completed task
autonomy:    execute-with-approval (L3)
evals:       42 cases; promotion gate >= 0.95 match
rollback:    void ticket, requeue to human, alert owner

Note what the manifest forces. “No erp.post” is the blast-radius decision made in writing rather than discovered in an incident. The cost cap is the finance defence, set before the bill arrives. The permissions-aware knowledge sources are the answer to “can the agent read something the user cannot.” The eval gate is what makes promotion arithmetic instead of opinion. An agent without a manifest is an agent nobody can defend in a review, and a project full of them is a project waiting to be cancelled.

Sequencing: deploy privately, instrument, then automate with evidence

The order is the point. Most cancelled projects tried to do all of this at once, or in the wrong order, and ran out of trust or budget before the value landed.

The sequence that survives:

Deploy privately first. Stand the agent up inside the trust boundary on real data. This clears the security veto and gets you a system that is allowed to do useful work.
Instrument before you automate. Turn on cost attribution and traces while the agent is still only suggesting. You want the meter and the logs running before anyone proposes giving the agent the ability to act.
Automate with evidence. Climb the autonomy ladder one rung at a time, using the cost and trace data to make each promotion an evidence-backed decision rather than a leap of faith.

This is slower than a demo and faster than a re-launch after cancellation. The 40% Gartner expects to scrap are mostly projects that were technically sound and operationally undefended. The model was never going to save them, and a better model will not save the next attempt. A boundary, a meter, a log, and an approval gate will. Put those four in place, in that order, and the cancellation conversation does not start.