Map a workflow
Blog

What you can and cannot prove about the AI models you rent from an outside lab

An honest account of what you cannot check about the leading AI models run by outside labs, why they keep it private, and what you can still control: your own tests, a record of every step, and locking the exact version you trust.

Line-art of an opaque vault with a small verdigris glass viewport revealing only a sliver of what is inside

A frontier model is a large, general-purpose AI model trained and operated by an outside lab, and reached through that lab’s API. When a regulated team plans to put one inside a workflow, the honest answer to “can we audit it?” is: partly. You can verify what the model does on your inputs, here and now. You cannot verify how it was built, what changed last week, or why it refused a request. This post separates the two, fairly, so your compliance owner knows exactly where the evidence ends and where your own controls have to begin.

The point of this piece is not that frontier labs behave badly. There is no evidence that any lab deliberately weakens a model to harm customers or competitors, and we will not imply it. The point is narrower: there is a gap between what a closed model lets you see and what an auditor, a regulator, or your own general counsel needs to see. Naming that gap precisely is the first step to engineering around it.

What can you actually verify about a closed frontier model?

Quite a lot, as long as you stay on the outside of the model and observe its behavior.

You can verify outputs against your own inputs. You can run a fixed set of test cases and record what came back. You can measure accuracy, refusal rate, latency, and cost on tasks that matter to you. You can capture the full request and response, and you can compare two versions side by side on the same inputs. None of this requires the lab’s cooperation, because it is all observable from where you sit.

This is the verifiable surface, and it is the foundation of everything that follows. If you build your own evaluation set (a fixed, labeled collection of test cases that represent your real work) and run it on every model you consider, you have hard evidence about behavior. That evidence is yours, it is reproducible, and it does not depend on any vendor disclosure.

What can you not verify, and why does it matter?

Five things sit on the far side of the gap. Each one is a place where your auditor’s questions run out of answers.

What you cannot verifyWhy it matters to a regulated team
Exact training dataYou cannot confirm what the model learned from, whether your sector’s data was included, or whether copyrighted or regulated content shaped it.
The hidden system promptA system prompt is the standing instruction the lab attaches to every request. It can change refusal behavior and tone without notice, and most labs do not publish it.
Silent mid-version updatesA model behind the same name can be retuned. Your validation from last quarter may describe a model that no longer exists.
The real effect of safety tuningRefusals and hedging are deliberate, but their effect on a specific task (a legal redline, a clinical summary) is hard to measure from the outside and harder to predict.
Benchmark and eval reproducibilityPublic scores can drift, leak, or be optimized against. A leaderboard number is not evidence that the model will perform on your work.

Two of these deserve a closer look, because they are the ones that surprise non-technical leaders.

Models change while keeping the same name

The cleanest documented example is a 2023 study from Stanford and UC Berkeley, “How Is ChatGPT’s Behavior Changing over Time?”. The researchers compared the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on identical tasks. On one math task (identifying prime numbers), the March version scored 97.6 percent and the June version scored 2.4 percent. Other tasks moved the other way. The study does not prove the model got “worse” overall, and the authors are careful about that. It proves something simpler and more important for governance: the same product name can route to materially different behavior over time, and you will not be told the day it changes.

97.6% → 2.4% GPT-4's accuracy on one prime-number task between March and June 2023, same product name Chen, Zaharia, Zou (2023), arXiv:2307.09009

This is why a one-time validation is not a control. It is a snapshot of a moving target.

Benchmarks are not evidence about your work

Public benchmarks are useful for comparing models in general, but they have known limits when openness meets competition. A 2025 analysis, “Pitfalls of Evaluating Language Models with Open Benchmarks”, shows how open test sets can leak into training and how models can be optimized against published tasks, which inflates scores without improving real-world capability. Even Stanford’s HELM project, which is built specifically for transparent and reproducible evaluation, exists because reproducibility is hard and worth engineering for. The takeaway for a buyer is plain: a benchmark tells you how a model did on someone else’s questions, not how it will do on yours.

Why do labs keep this private? The fair case

It would be easy to read the list above as a list of grievances. It is fairer to steelman the lab’s reasoning first.

Safety. Publishing the full system prompt and the exact tuning recipe makes jailbreaks (deliberate attempts to bypass safety controls) easier to engineer. A lab withholding details is, in part, protecting the controls that keep the model from producing harmful output.

Intellectual property. Training data composition, data-cleaning methods, and tuning techniques are the lab’s core competitive asset. No software company publishes its source code and pipeline, and a frontier model is among the most expensive artifacts ever built. Demanding full disclosure is not realistic and would slow the field.

Abuse prevention. Detailed disclosure of how a model can be steered is also a manual for steering it badly. Some opacity is a security feature, not a cover-up.

These are real reasons, and a serious buyer should grant them. The mistake is to conclude that because the lab’s secrecy is reasonable, the enterprise’s audit problem is solved. It is not. Two things can be true at once: the lab is acting in good faith, and you still cannot show your regulator the evidence they require.

Doesn’t regulation close the gap?

Partly, and it helps, but not enough to rely on alone.

The EU AI Act now requires providers of general-purpose AI models to publish a summary of the content used to train the model, using a mandatory template the European Commission released in July 2025, with obligations that began applying on 2 August 2025. This is genuine progress. It is also, by design, a summary. It tells you the categories and main sources of training data. It does not give an auditor the dataset, the tuning record, or a guarantee that the model you call today matches the one described. Legacy models on the market before August 2025 also have until 2 August 2027 to comply, so the disclosure landscape will be uneven for a while.

Some labs go further voluntarily. Anthropic, for example, publishes the system prompts for its Claude models and updates them with releases, which is more than most labs do. That is worth crediting. But voluntary transparency is exactly that: voluntary, partial, and revocable. A control you cannot enforce is not a control you can put in front of an auditor.

So what can you control? Move the evidence to your side

The transparency gap is real and it is not going to close from the lab’s side soon. The practical response is not to win an argument about disclosure. It is to relocate the evidence you need to your own boundary, where you can produce it on demand. Four controls do most of the work, and each is something a regulated team can hold in its own hands.

  1. Your own evals on your own data. Build a fixed, labeled evaluation set from your real tasks and run it against every model and every version before and after you put it into production. This converts “we trust the vendor” into “here is the score on our work, dated and reproducible.” This is the heart of what we describe in the cost per completed task: quality has to be measured on the job, not on a leaderboard.

  2. Your own execution traces. Record every request, response, tool call, and decision, and keep the records inside your environment. A trace is the difference between “the agent did something” and “here is exactly what it did, with what input, under which policy version.” This is the audit trail an auditor actually asks for, and it lives on your side of the line. See how we structure this in the control tower.

  3. A pinned or owned model. Pin to a specific dated model version so a silent update cannot move your target without your knowledge, and prefer open-weight models you can run inside your own boundary when the workload demands it. An owned model does not change underneath you. A pinned one changes only when you decide to move. Our view on keeping inference inside your perimeter is in inside the trust boundary.

  4. An audit trail you hold. Keep the eval results, the traces, the version manifests, and the policy decisions in storage you control, with retention you set. When the regulator, the auditor, or your own general counsel asks for evidence, you produce it from your own systems, not from a vendor support ticket.

None of this requires the lab to open up. That is the point. You cannot audit what you cannot see, so you stop depending on seeing it. You measure behavior on your data, you record what happened, you pin what you can, and you keep the record. The model stays a partly-closed box. Your evidence does not.

The chart is a rough scale of how much evidence you can hold yourself, from none (0) to full (3). The first two stay near zero because they live inside the lab. Version stability moves up the moment you pin. Behavior and traces are fully yours. The strategy is to lean your compliance posture on the columns you can fill, not the ones you cannot.

This is the same reasoning that runs through our work on the trust boundary and our security posture: keep the regulated payload and the evidence of conformity inside infrastructure you own, and treat the frontier model as a capable but partly-opaque component you wrap in controls rather than trust on faith. For how this fits a graded rollout, see the autonomy ladder and the rest of our insights.

FAQ

Can I ever fully audit a closed frontier model?

No, not from the inside. You cannot verify its training data, its full system prompt, or the exact internal reason for a given output. You can fully audit its behavior on your inputs, and you can hold complete records of what it did inside your workflows. The realistic goal is a verifiable behavioral record under controls you own, not a full inspection of the model’s construction.

Is it fair to say labs hide things to hurt customers?

No. There is no evidence that frontier labs deliberately degrade models to harm customers or competitors, and asserting it would be both unfounded and defamatory. Labs withhold details for defensible reasons: safety, intellectual property, and abuse prevention. The legitimate concern is not bad faith. It is that reasonable secrecy still leaves an enterprise unable to produce the evidence its regulators require, which is a problem the enterprise has to solve on its own side.

Does pinning a model version remove the transparency gap?

It removes one specific risk: the silent mid-version update that changes behavior without notice. Pinning to a dated version means the model does not move underneath you until you choose to move it. It does not reveal the training data or the system prompt. To close more of the gap, combine pinning with your own evals, your own traces, and, where the workload justifies it, an open-weight model you run inside your own boundary.