Cost per finished job: the one AI number your CFO will actually trust

Most AI dashboards report cost per token or cost per 1,000 calls. Those numbers are easy to pull from a billing API and easy to chart. They are also close to useless for deciding whether a workflow is worth running. A CFO does not buy tokens. A CFO buys resolved tickets, cleared invoices, and drafted memos. The unit that maps to the business is the completed task, and that is the unit you should be measuring.

Why per-token cost lies in an agentic workflow

For a single prompt and a single response, cost per token and cost per task are roughly the same thing. Agentic workflows broke that equivalence, and they broke it quietly.

A modern agent does not make one model call to finish a job. It makes a plan, calls a tool, reads the result, decides the result is incomplete, retrieves more context, calls the model again, and sometimes loops. Four things now sit between “task started” and “task done,” and each one inflates token spend without appearing in a naive per-call view:

Retries. A failed tool call, a malformed JSON output, or a timeout triggers another attempt. The retries consume tokens and produce nothing the business can sell.
Tool calls. Each tool invocation typically requires the model to read the tool result and reason about it. A five-tool task can mean six or seven model round-trips.
Multi-step plans. Planner-executor patterns spend tokens on the plan itself, then on each step, then on a verification pass.
Retrieval. Stuffing retrieved documents into the prompt is often the single largest line item. A 2,000-token question can carry 18,000 tokens of context.

The result: two workflows can have identical per-token rates and wildly different costs per finished task. The one with a 30% retry rate and sloppy retrieval can cost three times as much per resolved ticket while looking identical on a token dashboard. Per-token cost measures the meter. Cost per completed task measures the bill.

Attributing cost to a workflow, a role, and a team

You cannot compute cost per completed task if every model call lands in one undifferentiated bucket. Attribution is the prerequisite, and the place to do it is the gateway.

Route every model call through a single AI gateway so that no team is calling provider APIs directly. The gateway is the one policy point in the architecture, so it is also the natural place to meter. At the gateway, tag each request with three things at minimum:

Workflow ID (support-triage, invoice-clearing, memo-draft).
Role or step within the workflow (planner, retriever, drafter, verifier).
Owning team and a correlation ID that ties every call in one task to a single trace.

The correlation ID is what turns scattered calls into a unit. The traces land in a trace lake inside your own environment, written to the OpenTelemetry GenAI semantic conventions so cost analysis is not hostage to one vendor’s export format. A trace collects every model call, tool call, retrieval, and retry for one task attempt under one identifier. Sum the token cost across that trace, divide by the number of tasks that actually completed (not started), and you have cost per completed task for that workflow. Roll it up by team for the finance view; drill into the trace for the engineering view. The gateway gives you the spend; the trace gives you the structure; the completion signal gives you the denominator.

Rule of thumb: if you cannot point to the trace that produced a number, the number is an estimate, not a measurement.

The completion signal matters as much as the cost. A task is “completed” only when it met its acceptance criterion: the ticket was resolved and not reopened within 48 hours, the invoice cleared reconciliation, the memo passed review. Counting started tasks instead of completed ones is the most common way teams flatter their own economics.

The five levers that actually move the number

Once you can measure cost per completed task, you can move it. In our work the same five levers do most of the work, roughly in order of impact.

1. Model routing

Most workflows have an easy majority and a hard minority. Roughly 80% of support tickets are password resets, status checks, and known issues. Sending all of them to your largest model is the most expensive habit in production AI. Route the easy 80% to a small, fast model and reserve the large model for the cases a classifier flags as ambiguous or high-risk. This single change often halves cost per completed task on its own.

2. Caching

Prompt prefixes, system instructions, and frequently retrieved documents repeat across thousands of tasks. Prompt caching at the gateway or provider level lets you pay for those tokens once rather than once per call. Caching is most effective precisely where retrieval is heaviest.

3. Retrieval quality

Bad retrieval costs twice: once for the wasted tokens you stuffed into the prompt, and again for the retries when the model could not find the answer in the noise. Tightening retrieval (better chunking, reranking, returning three relevant passages instead of fifteen mediocre ones) cuts prompt size and failure rate at the same time.

4. Prompt size

Long, defensive system prompts and verbose few-shot examples are paid for on every call. Trim them to what the model demonstrably needs. Measure the quality before and after; do not trim blind.

5. Cutting failed runs

A run that fails after consuming tokens is pure cost with zero output. Early validation, schema-constrained outputs, and circuit breakers that stop a looping agent after N steps remove that waste. Cutting the failure rate from 30% to 8% directly cuts cost per completed task, because the denominator (completed tasks) stops being diluted by burned attempts.

A worked example: support triage

The numbers below are illustrative, but the shape is one we see repeatedly. The point is the arithmetic, so the arithmetic is shown.

A support triage agent handles 10,000 ticket attempts a month. The first version routes everything to a large model, retrieves fifteen knowledge-base passages per ticket, and has a 28% failure rate: tickets that escalate to a human after the agent gives up, having already spent tokens.

	v1 (before)	v2 (after levers)
Tokens per attempt	~20,000 (heavy retrieval)	~4,000 (routed + reranked)
Blended cost per attempt	$1.01	$0.35
Failure rate	28%	9%
Completed tickets / month	7,200	9,100
Total monthly spend	$10,080	$3,486
Cost per completed ticket	$1.40	$0.38

Read the two columns against each other and the levers are visible. Cost per attempt fell because a classifier sent the routine 80% to a small model and reranking cut retrieval from fifteen passages to three, with cached system prefixes on top. The completed-ticket count rose because a step limit and an output schema cut the failure rate from 28% to 9%, so fewer attempts burned tokens without producing a resolved ticket. The denominator stopped being diluted by waste, and the numerator got lighter at the same time.

The per-token rate barely moved on the hard 20%; the large model still costs what it costs. What moved was the mix, the waste, and the prompt weight. That is the point. You optimize the workflow, not the token. The same arithmetic feeds the workflow analytics layer that finance and operations actually read: cost per workflow, success rate, escalation reasons, time saved, and which workflows have earned the right to more autonomy.

The autonomy decision rides on this metric

Cost per completed task is not only a finance number. It is the gate for increasing autonomy.

The temptation, once an agent works, is to widen its scope and reduce human review. Do that on quality data alone and you will eventually ship a workflow that is accurate but unaffordable, or affordable but quietly degrading. The discipline is to require both:

Quality proven: the workflow meets its acceptance criterion at the target rate, measured on completed tasks, over a representative period.
Cost proven: cost per completed task is stable and within the budget finance signed off on, with no upward drift hidden behind a flat per-token rate.

Only when both hold do you remove a checkpoint or expand the agent’s mandate. Each step toward more autonomy is a decision you should be able to defend with two numbers from the same traces: what it costs to finish the task, and how often it finishes it correctly. If you can produce only one of those, you are not ready to automate further.

A token dashboard cannot answer that question. Cost per completed task can, which is exactly why it is the metric your CFO will trust and the one your engineers should be instrumenting first.