Why an AI agent costs more than the bill suggests: retries, lookups, and dead ends

A token is cheap. A finished business task built from hundreds of token calls is not always cheap, and the price tag is hidden. Per-token API pricing looks like the cost of running an agent, but in an agentic workflow the model is called many times per task: it plans, calls a tool, reads the result, retrieves documents, fails, and retries. The number that matters to finance is cost per completed task, not cost per token, and the gap between the two is where budgets quietly break.

Why does per-token pricing hide the real cost?

For a single prompt and a single answer, cost per token and cost per task are nearly the same thing. One call in, one answer out, one bill. Agentic workflows broke that equivalence, and they broke it quietly because the billing API still reports the same flat per-token rate.

An agent does not finish a job in one model call. It makes a plan, calls a tool (a search, a database query, an API), reads the tool result back into context, decides the result is incomplete, retrieves more documents, calls the model again, and sometimes loops. Several distinct cost drivers now sit between “task started” and “task done,” and each one inflates token spend without showing up in a naive per-call view:

Retries. A failed tool call, a malformed JSON output, or a timeout triggers another attempt. The retry consumes tokens and produces nothing sellable. The Anthropic tool-use docs note that each tool call also adds a system-prompt overhead (for example, roughly 290 to 500 tokens just to enable tool use), and that overhead is paid again on every retry.
Tool calls. Each tool invocation typically forces the model to read the tool result and reason about it. A five-tool task can mean six or seven model round-trips, and the growing context is re-sent on each hop.
Multi-step plans. Planner-executor patterns spend tokens on the plan, then on each step, then often on a verification pass.
Retrieval. Stuffing retrieved documents into the prompt is frequently the single largest line item. A 2,000-token question can carry 18,000 tokens of context, and if the agent re-reads that context across several round-trips, you pay for it repeatedly.

The result: two workflows can post identical per-token rates and wildly different costs per finished task. The one with a 30 percent retry rate and sloppy retrieval can cost several times as much per resolved ticket while looking identical on a token dashboard. Per-token cost measures the meter. Cost per completed task measures the bill.

What does an agentic task actually cost?

To make this concrete, here are the real, current list prices that anchor the worked example below. All figures are per million tokens (MTok) and come from the providers’ own pricing pages.

Model class	Input	Output	Cached input (read)
Large (Claude Sonnet 4.6)	$3.00	$15.00	$0.30 (0.1x input)
Small (Claude Haiku 4.5)	$1.00	$5.00	$0.10 (0.1x input)
Large (OpenAI GPT-4o)	$2.50	$10.00	$1.25 (0.5x input)
Small (OpenAI GPT-4o mini)	$0.15	$0.60	$0.075 (0.5x input)

Sources: Anthropic pricing and OpenAI API pricing, both current as of mid-2026. Output tokens cost roughly five times input on the Claude models and four times on the OpenAI models, which is why a chatty, retry-heavy agent is so much more expensive than its input footprint suggests. Two things stand out for cost control: output is the expensive side of the ledger, and a cache read costs a fraction of fresh input (one tenth on Claude, one half on OpenAI), which is the documented basis for the caching lever below.

Anthropic publishes its own illustrative figure: processing 10,000 support-ticket conversations at roughly 3,700 tokens each on Haiku 4.5 costs about $37 in tokens. That is the floor, a single clean call per ticket. The number below shows what happens to that floor once you add the multi-step plans, tool calls, retrieval, and retries of a real agent, and then what disciplined AgentOps does to bring it back down.

A worked example: cost per resolved support ticket

The numbers below are illustrative, not a benchmark of any one customer. The shape is one we see repeatedly, and the arithmetic is shown so you can check it against your own traffic.

A support-triage agent handles 10,000 ticket attempts a month. Version 1 routes every ticket to the large model, retrieves fifteen knowledge-base passages per ticket and re-sends that context across several tool round-trips, and has a 30 percent failure rate: tickets that escalate to a human after the agent gives up, having already burned tokens. Version 2 applies three levers: a classifier routes the easy 80 percent of tickets to the small model and keeps the hard 20 percent on the large one; a reranker cuts retrieval from fifteen passages to three; prompt caching serves the repeated system prefix and knowledge base at the cached rate; and a step limit plus a strict output schema cut the failure rate from 30 percent to 8 percent.

	v1 (before)	v2 (after levers)
Model	Large only	Routed: 80% small / 20% large
Tokens per attempt (approx.)	~24,500	~7,000
Cost per attempt	$0.100	$0.015
Failure rate	30%	8%
Completed tickets / month	7,000	9,200
Total monthly token spend	$1,000	$150
Cost per resolved ticket	$0.143	$0.016

Cost per resolved support ticket (illustrative, US cents)

v1 (large model, heavy retrieval, 30% fail) 14.3¢

v2 (routing + caching + 8% fail) 1.6¢

Read the two columns against each other and the levers are visible. Cost per attempt fell because a classifier sent the routine 80 percent to a small model (a 3x-to-5x cheaper per-token rate), reranking cut retrieval from fifteen passages to three, and the repeated system prefix and knowledge base were served from cache at one tenth of the input price. The resolved-ticket count rose because a step limit and an output schema cut the failure rate from 30 percent to 8 percent, so fewer attempts burned tokens without producing a usable result. The numerator got lighter and the denominator got larger at the same time. The combined effect is an 89 percent drop in cost per resolved ticket.

89% illustrative drop in cost per resolved ticket from routing, caching, and cutting failed runs, with no change to the per-token list price Proqtor worked example, 2026

Notice what did not change: the per-token list price. The large model still costs what it costs on the hard 20 percent. What moved was the mix, the waste, and the prompt weight. That is the point. You optimize the workflow, not the token. A team staring only at the per-token rate would have concluded there was nothing to do, because the rate never moved.

Why this is also a governance and concentration question

For regulated teams, the hidden cost problem is not only a finance problem. The same opacity that hides the cost also hides the risk.

A workflow that quietly retries against a single vendor’s API ties both your spend and your continuity to that vendor’s roadmap. Vendor policies change, prices change, and models get deprecated on the provider’s schedule, not yours. Anthropic’s own docs maintain a model deprecations list, and a model your retry logic depends on can move to retired status. If you cannot see how many of your tokens are retries against one model, you also cannot see how exposed you are when that model’s terms or availability change. Cost visibility and supply-chain visibility are the same visibility.

This is why we argue for running agents inside your own boundary with pinned model versions, your own evals, and a full audit trail. When the meter, the traces, and the model choices all live in your environment, a price change or a deprecation becomes a routing decision you make, not an emergency the vendor hands you. The cost lever and the control lever are the same lever.

How a control tower makes the hidden cost visible

You cannot manage cost per completed task if every model call lands in one undifferentiated bucket. The fix is structural, and it lives at the gateway.

Route every model call through a single AI gateway so no team calls provider APIs directly. The gateway is the one policy point in the architecture, so it is also the natural place to meter. Tag each request with a workflow ID, the role or step within the workflow (planner, retriever, drafter, verifier), the owning team, and a correlation ID that ties every call in one task to a single trace. That correlation ID is what turns scattered calls, including the retries, into a single unit you can price.

The traces land in a trace store inside your own environment, written to open telemetry conventions so your cost analysis is not hostage to one vendor’s export format. A trace collects every model call, tool call, retrieval, and retry for one task attempt under one identifier. Sum the token cost across the trace, divide by the tasks that actually completed correctly, and you have cost per completed task per workflow. The control tower is what makes the retries, the dead ends, and the heavy retrieval show up as line items instead of disappearing into a flat per-token average.

Rule of thumb: if you cannot point to the trace that produced a cost number, the number is an estimate, not a measurement. And an estimate is exactly where hidden cost lives.

From there the optimization is just reading the traces and acting: which workflows have a retry tail, which carry too much retrieval context, which send easy work to an expensive model. That feeds the same workflow analytics that finance and operations read (cost per workflow, success rate, escalation reasons), and it is the same evidence base described in cost per completed task. The control tower turns a hidden, per-token bill into a visible, per-workflow one you can actually drive down.

FAQ

Is cost per token ever the right metric?

For a single, one-shot call (a classification, a short summary), per-token cost and per-task cost are close enough to treat as the same number. The metric breaks down precisely when a task spans multiple model calls, tool round-trips, retrieval, or retries, which is the definition of an agentic workflow. For agents, measure cost per completed task and keep per-token rates only as an input to that calculation.

Do retries really add that much cost?

They add more than the raw token count suggests. A retry re-sends the growing context (including any retrieved documents), pays the tool-use system-prompt overhead again, and produces nothing if it also fails. A workflow with a 30 percent failure rate is paying full token cost for three in ten attempts that the business throws away, which inflates the cost per successful task even when each individual call looks cheap. Cutting the failure rate is often the single largest lever on cost per completed task.

Can I get this visibility without sending data to a third-party observability vendor?

Yes, and for regulated workloads you should. The metering, traces, and analytics can run entirely inside your own cloud, VPC, or on-prem environment, written to open telemetry conventions so they remain portable. That keeps sensitive prompt and tool-result content inside your boundary while still giving finance and engineering the cost-per-completed-task numbers they need. See our security posture and the platform overview for how this is set up. More on the broader pattern in our insights.