Running AI agents inside your own walls, where your data is allowed to live

Most failed agent deployments do not fail on model quality. They fail at the security review, when someone asks where the data goes and the answer is a diagram with a dotted line pointing at a third party. The fix is not a better pitch. It is a clear boundary, drawn before the first prompt leaves the building.

This is the trust boundary: the line that separates what you own and control from what a vendor merely supports. Draw it well and the security review becomes a list of one-line answers. Draw it badly and every question becomes a renegotiation.

Two planes, one line

The cleanest way to draw the boundary is to split the system into two planes.

The data plane is customer-owned. Prompts, source documents, embeddings, retrieval indexes, execution traces, and logs stay inside infrastructure the customer controls. This is the regulated payload. Under GDPR it is where personal data lives; under the EU AI Act it is where the records that prove conformity are generated. If any of it crosses the boundary, you have a transfer to account for, a processor to contract, and a residency claim to defend.

The control plane is vendor-supported and carries metadata only. Deployment orchestration, observability dashboards, version manifests, policy definitions, health checks, and aggregate usage counters. The control plane knows that a workflow ran, which prompt version it used, whether it passed policy, and how long it took. It does not know what was in the document or what the user typed.

The discipline is in the word “only.” A control plane that ships full prompts back to the vendor for “debugging” has quietly moved the boundary. The test is concrete: if you severed the vendor connection at the network layer, the data plane should keep serving requests. Governance might pause. Service should not.

Rule of thumb: if losing the vendor link stops you serving customers, your data plane is not actually yours.

The seven components inside the boundary

A data plane that holds the line is not one thing. It is seven components that fit together, and naming them is what lets a security reviewer trace a request end to end without leaving the boundary.

A private model layer. Inference runs inside the boundary on open-weight models served by vLLM, TGI, or Ollama, or on an approved frontier model reached through controls the customer owns. The weights and the runtime are in the customer’s account, not a vendor’s.
The AI gateway. The single policy point, covered in detail below. Every call passes through it.
A knowledge layer with permissions-aware retrieval. Agents retrieve only what the calling user or workflow is already allowed to see.
An agent runtime. Each agent is a governed object with an owner, a purpose, allowed tools, a budget, a risk tier, and an eval suite, not a loose prompt.
A trace lake. Every run is recorded in the customer’s environment under the OpenTelemetry GenAI semantic conventions, so the evidence is portable and not a vendor’s private format.
Evaluation and replay. Real traces become regression tests, so you can change a prompt, a model, a retriever, or a tool and prove the change is safe before it ships.
Workflow analytics. Cost per workflow, success rate, approval rate, automation rate, time saved, and escalation reasons, computed over the trace lake, feeding the decision to promote a workflow up the autonomy ladder.

Components one through six live in the data plane. Only the aggregates from seven, stripped of payload, are the kind of metadata the control plane is allowed to see.

The AI gateway as the single policy point

Inside the data plane, the architecture needs one place where policy is decided rather than scattered across application code. That place is the AI gateway. Every agent call, retrieval, and tool invocation passes through it, which makes it the one component a security reviewer can read to understand what is enforced.

A gateway worth deploying handles, at minimum:

Routing. Which model serves which request, by task, tenant, or sensitivity class. Open weights for bulk classification, an approved frontier model for the cases that need it.
Auth and RBAC. Identity-aware authorization on every call, so a request inherits the caller’s permissions rather than the agent’s.
Budgets. Per-tenant and per-workflow spend and rate limits, enforced before the call, not reconciled on the invoice.
PII redaction. Detection and masking on the way in and out, so sensitive fields never reach a model that does not need them.
Prompt versioning. Every prompt pinned to a version, logged with the response, so a change in behavior maps to a change you can name.
Caching. Repeated prefixes and frequently retrieved context paid for once, not once per call.
Logging. Every call written to the trace lake, so the gateway is also the point where the evidence is generated.
Fallback. Defined behavior when a model is slow, down, or returns a policy violation: retry, downgrade, or refuse. Never silent failure.

Centralizing these turns governance into something you configure and audit in one file, not a property you hope holds across forty call sites.

Permissions-aware retrieval

Retrieval is where most data leaks hide. A naive RAG setup indexes everything and lets the agent fetch the nearest match, which means an agent can surface a document the user was never allowed to read. The index becomes a side channel around your access controls.

Permissions-aware retrieval closes it. The retrieval step filters on the caller’s identity and the workflow’s scope before similarity ranking, so the candidate set only ever contains documents the user may already see. The access control list travels with the query. An agent acting for a junior analyst cannot retrieve the board pack, because the board pack is filtered out before the model is ever asked to reason about it. This is the difference between an agent that respects your existing permissions and one that launders around them.

Choosing a deployment target

Where the data plane physically runs depends on the regulatory and threat profile of the team. There is no single right answer, only a defensible one.

VPC (virtual private cloud). The default for most regulated teams in finance, healthcare, and insurance. The data plane runs inside the customer’s own cloud account, under their KMS keys, their network policy, their audit logging. The vendor reaches in through a scoped control-plane connection that carries metadata. This satisfies most DORA and NIS2 operational-control expectations without giving up the economics of managed cloud.

On-prem or air-gapped. For defense, critical infrastructure, and the most sensitive intelligence and industrial work. The data plane runs on hardware the customer owns, often with no outbound internet path. The control plane, if present at all, syncs through a controlled diode or by manual export. The private model layer matters most here, because an air-gapped system cannot call a hosted frontier API and depends on open weights served locally.

Sovereign region. For GDPR data residency where the requirement is that personal data stay within a specific jurisdiction. The data plane runs in an in-region deployment so that no Standard Contractual Clauses are needed for an EU-to-elsewhere transfer, because there is no transfer. The residency claim is structural, not contractual.

Edge or on-device. For computer vision, robotics, and field work where latency, intermittent connectivity, or bandwidth rule out a round trip to a central cluster. Inference runs on the device or a nearby edge node. The data plane is the device itself; the control plane collects metadata when connectivity allows. Smaller open-weight models earn their place here on size and latency grounds.

The boundary holds across all four. What changes is how far the vendor can reach and how much of the control plane survives. The data plane is always yours.

A model-agnostic stance

None of this assumes a single model vendor, and it should not. The gateway routes to whatever serves the request best: open weights where they fit on quality, cost, or deployability, and an approved frontier model where the task genuinely demands it. The trust boundary makes this practical. Because the gateway is the policy point and the data plane is yours, swapping a model is a routing change behind the boundary, not a new data-transfer assessment. The eval and replay component is what makes the swap safe: you run the candidate model against real traced cases and read the regression before you cut over. Lock-in to one model is a governance liability before it is a commercial one.

What the security review actually asks

A real security review is not abstract. It is a list of pointed questions, and an in-boundary architecture answers each in one line.

Where does our data live? In your VPC, on-prem, or sovereign region. The data plane never leaves it.
What does the vendor see? Metadata only: versions, timings, policy outcomes, aggregate counts. No prompts, no documents.
Can an agent read data the user cannot? No. Retrieval filters on the caller’s permissions before ranking.
What happens if a model is down? The gateway falls back per a defined policy: retry, downgrade, or refuse. Service continues.
How do we cross-border transfer under GDPR? We do not. The sovereign-region deployment means no SCCs are needed.
Can you prove what a prompt did last Tuesday? Yes. Every prompt is versioned and logged with its response in your trace lake.
What is the blast radius if the vendor is breached? The control plane. Your data is not in it.
Who holds the encryption keys? You do, in your KMS. The vendor cannot decrypt your data plane.
Are you monitoring our employees? No. The traces instrument agents and workflows. Personal data in them is redacted at the gateway, and EU AI Act Annex III treats employment monitoring as high-risk, so the design optimizes workflows, not individuals.

If your architecture cannot answer these in single sentences, the review will surface that, usually late, after the build. The trust boundary is the design that makes the answers short. Draw the line first, run governance through one gateway, keep the data where it is allowed to live, and the rest of the deployment stops being an argument and starts being a checklist.