How to know when an AI agent is safe to let run on its own

The question we hear most often is phrased as a binary: should this agent be an assistant or should it be autonomous? Stated that way, the answer is always “assistant”, because nobody can defend “autonomous” without evidence, and the evidence does not exist yet. So the agent stays a glorified autocomplete forever, and the value case never closes.

The binary is the wrong frame. Autonomy is not a switch. It is a ladder, and you climb it one rung at a time, per workflow, with each promotion earned from traces rather than asserted in a roadmap. This is a playbook for the rungs, what changes at each, and the evidence you need to move up.

Why a ladder, not a switch

A single agent rarely does one thing. It drafts, it recommends, it acts, it escalates. Each of those behaviors carries a different blast radius. Drafting an email is reversible. Issuing a refund is not. Treating the whole agent as one autonomy setting forces you to govern the refund at the speed of the email, or the email at the cost of the refund. Neither is correct.

So the unit of autonomy is the workflow, not the agent. Different workflows sit at different rungs at the same time, and that is the normal, healthy state of a deployed system. The agent that drafts customer replies at rung 4 can still be at rung 1 for anything touching billing.

The other reason for a ladder is that promotion has to be falsifiable. “We made it autonomous” is a claim. “Over 4,200 traced runs the recommended action matched the approved action 98.1 percent of the time, with a documented rollback for the 1.9 percent” is a finding. The ladder turns autonomy into something an examiner can check.

The six rungs

L0 - Assistant

The agent answers questions and produces content inside a human’s working session. It does not touch systems of record. There is no action to approve because there is no action. This is where every workflow starts.

What changes to leave L0: nothing automatic. You leave L0 only when a specific workflow has a defined action, a defined owner, and a place to record what happened. If you cannot name the trace store, you are not ready to climb.

L1 - Draft plus approval

The agent proposes a concrete artifact (a reply, a ticket, a config change) and a human edits and submits it. The human is still the actor of record. The value here is speed of drafting, not delegation.

Evidence to promote from L1: an approval rate you can read off traces. If reviewers accept the draft with light edits most of the time, the workflow is a candidate for L2. If they rewrite it wholesale, stay at L1 and fix the drafting first. Promotion is not a calendar event.

L2 - Recommended action

The agent names the action it would take and the parameters it would use, but still does not execute. A human picks from recommendations. The difference from L1 is that the agent is now reasoning about effects on systems, not just producing text.

Evidence to promote from L2: recommendation quality measured against what humans actually chose, plus a defined eval set for the action space. You want the match rate between recommended and chosen action, broken down by action type, because the aggregate can hide a dangerous tail.

L3 - Execute with approval

The agent executes the action itself, but only after a human approves that specific instance. This is the first rung where the agent is the actor. The approval is per run, in line, with a record of who approved and what they saw.

Evidence to promote from L3:

Approval rate above a threshold you set per workflow, sustained over a meaningful volume, not a demo week.
Failure rate of executed actions, including silent failures where the action ran but produced the wrong outcome.
Replay coverage: the share of executed runs you can reconstruct from traces, inputs, tool calls, and outputs included. If you cannot replay it, you cannot defend it.

L4 - Execute under policy

The agent executes without per-run approval, inside an explicit policy envelope: allowed actions, value limits, rate limits, data scopes, and the conditions that force an escalation. Humans no longer see every run. They see the ones policy flags and a sample of the rest.

Evidence to promote from L4:

A written policy that a non-engineer can read, with each rule traceable to a control (RBAC, value caps, KMS-scoped credentials, allowed tools).
Business-outcome quality, not just task success. Did the refunds reduce contact volume without inflating fraud? Task-level pass rates lie when the business effect is bad.
A rollback plan that has been exercised, not just written. You should have actually reverted a batch of agent actions in a drill and measured how long it took.

L5 - Autonomous, humans on exceptions

The agent runs the workflow end to end. Humans are pulled in only on exceptions the policy raises, plus scheduled audits. This is the top rung, and most workflows never reach it, nor should they.

Evidence to hold L5: everything from L4, sustained, plus continuous evals that run against live traffic and alert on drift. L5 is not a destination you arrive at and forget. It is a state you keep earning, because models change, data changes, and the policy that was safe last quarter may not be safe now.

The evidence gates, made explicit

“Promote on evidence” is only useful if the evidence has numbers attached. The thresholds below are starting points, set per workflow against its blast radius, but the shape is what travels. A low-risk drafting workflow can run looser; a workflow that moves money runs tighter.

Gate	L1 to L2	L3 to L4	L4 to L5
Approval / match rate	drafts accepted with light edits most of the time	>= 95% over a meaningful volume	>= 98%, sustained
Failure rate (incl. silent)	n/a, no execution	below the human baseline for the task	below baseline, with drift alerts
Replay coverage	partial is fine	100% of executed runs	100%, continuous
Outcome quality	reviewer judgement	business metric moves the right way	business metric holds over time
Rollback	not required	tested in a drill, timed	exercised under real load

A worked promotion for a support-refund workflow makes the table concrete. At L3 the agent has issued refunds under per-run approval for six weeks across 4,200 runs. The approval rate is 98.1%, the executed-action failure rate is below the human team’s own error baseline, and every run replays from its trace. Operations runs a rollback drill: a batch of 50 refunds is reverted in eleven minutes. That clears the L3-to-L4 gate, so the workflow moves to a narrow L4 with a 200 EUR value cap and automatic escalation on anything above it or flagged as fraud-adjacent. L5 is not on the table for this workflow, because a wrongful refund touches money and the residual tail is not one the team wants to run unattended.

Promotion is read from traces, not promised in a plan

The recurring failure we audit is a roadmap that says “L5 by Q3” with no instrumentation to support any rung above L1. The dates are real, the agent ships, and the first time anyone looks at a trace is during the incident review.

Reverse it. Instrument first. Every run, at every rung, should emit a trace you can replay: the inputs, the tool calls, the outputs, the approver if any, and the policy decision if any. Promotion then becomes arithmetic over that record. You do not argue about whether the agent is ready. You read the approval rate, the failure rate, the replay coverage, and the outcome quality, and the numbers either clear the bar or they do not.

Rule of thumb: if you cannot replay a run from its trace, that workflow cannot be promoted past L3. No replay, no autonomy.

This also makes demotion possible, which matters more than it sounds. A workflow at L4 that starts failing should fall back to L3 automatically, the same way a circuit breaker trips. Autonomy that can only go up is not governance, it is a ratchet pointed at your liability.

A note on what you are instrumenting

The traces that justify promotion record the agent’s behavior, not the employee’s. That distinction is a design constraint, not a nicety. The principle is to trace agents deeply and trace humans carefully, and to optimize the workflow rather than the individual working it. Under the EU AI Act, Annex III places employment monitoring in the high-risk category, and the regulation prohibits emotion recognition in the workplace outright, so a trace lake that quietly turned into a productivity surveillance system would be both a compliance problem and a trust problem. Keep the evidence about the workflow. Redact personal data at the gateway before it reaches the trace. The ladder is a way to govern machines, not to grade people.

Do not skip rungs on consequential workflows

The pull toward L4 and L5 is strongest exactly where it is most dangerous: high-volume, high-value workflows where the labor savings look enormous. Payments, account changes, anything that touches regulated data under GDPR, anything inside a financial entity’s scope under DORA, anything an EU AI Act risk classification would flag.

For those, the rule is simple and unpopular: you do not get to L4 until traceability, evals, approvals, and a tested rollback all exist and have been exercised under load. Not designed. Exercised. The replay has to work on a real failed run, the eval set has to cover the tail cases, the approval queue has to have processed real volume, and someone has to have actually rolled back a batch and timed it.

A reasonable shape for a consequential workflow looks like this:

Weeks at L1 and L2 while the eval set and trace store mature.
A measured stretch at L3 where humans approve every execution and you accumulate approval and failure rates.
A narrow, scoped L4 with tight value caps and aggressive escalation, widened only as the numbers hold.
L5 reserved for the slices where the data has earned it, with continuous evals as a condition of staying there.

The ladder is slower than a roadmap that promises autonomy by a date. It is also the only version of the story you can defend in an audit, an incident review, or a regulator’s questionnaire. Climb it one rung at a time, per workflow, on evidence you can replay.