Map a workflow
Blog

AI models you can download and run yourself: when they are the right call

A sourced guide to running AI models you can download and run yourself (Llama, Mistral, Qwen) inside your own environment, whether your work is regulated or not: what that really means, how to host them privately, and a clear rule for when to choose them.

Abstract line-art of a balanced scale weighing an open padlock against a closed vault, on a limestone background with a verdigris accent.

For a regulated team, the question is rarely “which model is smartest.” It is “which model can we run where our data is allowed to live, prove what it did, and keep running when a vendor changes a policy.” On that test, open-weight models earn a place in the stack: you hold the weights, you pin the version, you serve them inside your own boundary. They are not the right call for every workflow. This is a practical guide to where they fit, how to run them, and the decision rule that keeps you honest.

What does “open-weight” actually mean, and how is it different from open-source?

The two terms are used interchangeably in marketing and they should not be. The distinction is not pedantic; it changes your legal review.

Open-weight means the provider publishes the trained model parameters (the weights). You can download them, run inference on your own hardware, and usually fine-tune them. What you typically do not get is the full training dataset, the data-processing pipeline, or the complete training code. You can use the model, but you cannot fully reproduce it.

Open-source, in the strict sense, is a higher bar. In October 2024 the Open Source Initiative published the Open Source AI Definition (OSAID 1.0), which requires that an AI system grant four freedoms (use, study, modify, share) “for any purpose and without having to ask for permission.” To qualify, a provider must release not just the weights but also “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system” and “the complete source code used to train and run the system.” Most popular “open” models do not meet that bar.

The clearest example is Llama. The OSI and the Free Software Foundation have both declined to classify it as open source. The Llama 3.1 Community License is a custom commercial license, not an OSI-approved one: it carries an Acceptable Use Policy, an attribution requirement (“Built with Meta Llama”), and a field-of-use restriction (companies with over 700 million monthly active users must request a separate license from Meta). Those restrictions are commercially reasonable, but they violate the OSI’s “any purpose, without permission” freedom, which is exactly why Llama is open-weight and not open-source.

Other releases sit in different places. Mistral and Alibaba’s Qwen mix permissive Apache 2.0 releases with custom-licensed ones, so you have to read each model card. And in August 2025, OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0, its first open-weight models since GPT-2. The lesson for a regulated team is simple: do not assume the license from the label. Read it per model, because the license dictates what your legal team has to sign off on.

Why would a regulated team prefer open weights at all?

The case for open weights in a regulated environment is not about cost or ideology. It is about control, and it maps directly to the concerns a security review raises.

  • Residency and the boundary. When you serve the weights yourself, the prompt, the document, and the response never leave your environment. There is no transfer to a third party to account for under GDPR, and no external API in your data path. This is the foundation of keeping AI inside the trust boundary.
  • Version pinning. A hosted frontier model can be deprecated or silently updated. An open-weight checkpoint you have downloaded does not change underneath you. You decide when to upgrade, and you re-run your evals before you do. For a workflow that has to behave identically across an audit period, a pinned local checkpoint is a real control, not a preference.
  • No retention or training-use ambiguity. When inference runs on your hardware, there is no question about whether prompts are logged, retained, or used to train a future model. The answer is structural: they are not, because they never left.
  • Air-gapped and sovereign deployment. Some environments (defense, certain critical infrastructure) have no outbound internet path at all. There, a hosted API is not an option and an open-weight model served locally is the only way to get capable AI inside the perimeter.

There is a regulatory tailwind too. Under the EU AI Act, providers of models released under a genuinely free and open-source license get a narrower set of transparency obligations than closed-model providers (Article 53(2)), though the copyright policy and training-content summary obligations still apply, and the exemption falls away for models classified as posing systemic risk. That is a provider-side nuance, but it signals that the framework treats genuinely open models differently.

The steelman for the frontier vendor is fair: a managed API gives you the newest capability, safety tuning, and scaling without an MLOps team. The enterprise’s legitimate concern is equally fair: that same managed relationship is the thing you do not control. Both are true. The point is to choose per workflow, not per company.

How do you serve an open-weight model privately?

This is the part that used to be hard and is now routine. By early 2026, a small number of inference engines account for the large majority of production open-model serving, and they are all open-source and self-hostable.

Serving stackOptimizes forGood fit whenNotes
vLLMFlexibility and throughputMost teams, most modelsBroad model support, runtime adapter swapping, large community. The usual default.
TensorRT-LLM (NVIDIA)Maximum throughput on NVIDIA GPUsExtreme scale, dedicated GPU teamRequires a per-model compilation step and specific GPU architectures. Highest absolute throughput, highest complexity.
TGI (Hugging Face)SimplicityQuick stand-upsNow in maintenance mode; Hugging Face itself points users toward vLLM or SGLang.
Ollama / llama.cppLocal and edgeLaptops, small edge nodes, prototypingLowest barrier to entry, lower throughput. Good for development and on-device work.

Independent 2025 benchmarks give a feel for the throughput differences. One comparison on an 8xH100 cluster reported TensorRT-LLM, vLLM, and TGI in that order on large-batch throughput, with TensorRT-LLM highest and TGI lowest. The practical takeaway is not the exact ranking, which shifts with model, precision, and batch size. It is that vLLM gives you most of the performance with the least operational pain, which is why it is the recommended starting point for most enterprises. Reach for TensorRT-LLM only when you have measured a throughput need that justifies the compilation and hardware constraints.

How big is the quality gap, and is it really narrowing?

This is where careful sourcing matters, because the honest answer is “smaller than most executives assume, but not zero, and it moves.”

The cleanest measurement comes from Epoch AI, which tracks model capability on its Epoch Capabilities Index (ECI). Their October 2025 analysis found that, on average between January 2023 and that date, open-weight models lagged the closed state of the art by about 3.5 months (a 90% confidence interval of roughly 1 to 5 months), an average gap of around 7 ECI points. A subsequent update put the figure closer to four months. To anchor that: Epoch notes the ECI gap is similar in size to the difference between two recent versions of a single frontier model line, not a generational chasm.

~3 to 4 months how far the best open-weight models lag the closed state of the art, by capability index Epoch AI, 2025

The longer trend is the more useful story. The Stanford AI Index has tracked the top closed model’s lead over the top open model shrinking from double digits on knowledge benchmarks in 2023 toward low single digits by 2026. The inflection point most analysts cite is the January 2025 release of DeepSeek-R1, an open-weight reasoning model that performed near the closed frontier at a fraction of the cost, followed by strong releases from Qwen, Llama, and OpenAI’s gpt-oss.

Two caveats keep this honest. First, these are aggregate benchmarks; your workflow is not a benchmark. A model that tops MMLU may still trail on your specific legal-clause extraction or claims-adjudication task. The only number that matters is how a model scores on your evals, run against your real traced cases. Second, the gap is uneven: it tends to be smaller on knowledge and reasoning and larger on long-horizon agentic tasks, tool use, and the newest capabilities, where frontier labs ship first. So the right framing is not “open weights have caught up.” It is “open weights are close enough that, for many regulated workflows, the control you gain outweighs the few points you give up. For others, it does not, and you should pay for the frontier model.”

So what is the decision rule?

Keep it concrete and apply it per workflow, not per organization. Run the candidate model (open or closed) against your own eval suite of real, traced cases, and then weigh capability against control.

Choose open weights when…Choose an approved frontier model when…
Data residency or an air-gapped environment rules out an external APIThe task demands the top of the capability frontier and your evals confirm open weights fall short
You need a pinned, unchanging version across an audit periodYou lack the MLOps capacity to serve and maintain models, and the workflow is low-sensitivity enough to allow an API
The workflow is high-volume and cost-sensitive (classification, extraction, routing, drafting)The workflow is low-volume but high-stakes, where a few points of accuracy justify the premium and the data sensitivity is manageable
Your evals show the open model clears your quality bar for this taskYou need the newest agentic or tool-use capability that frontier labs ship first

The two columns are not a permanent split. A workflow can start on a frontier model to ship quickly, then move to an open-weight model once you have traces to evaluate against and the volume to justify self-serving. The reverse happens too: an open-weight workflow that hits a capability ceiling gets routed to a frontier model for the hard cases only. This is why the model-agnostic gateway matters. It turns “which model” from a one-time architectural bet into a routing decision you can revise with evidence.

What does not change is the boundary. Whichever model serves a given call, the prompts, documents, embeddings, traces, and evals stay in your environment. That is what makes the choice reversible and the audit trail intact. For the controls that hold this together, see our trust and security posture and the broader control tower.

The bottom line

Open-weight models are not a budget compromise and they are not a silver bullet. They are a control mechanism. They let a regulated team hold the version, keep the data inside the boundary, and remove a third party from the data path, at the cost of a few points of capability on the hardest tasks and the responsibility of serving them yourself. Run the license review first, run your own evals second, route everything through one gateway, and let the evidence (not the marketing) decide which model serves which workflow. For more on choosing deployment targets and where the data is allowed to live, see Inside the trust boundary and our other insights.

FAQ

Is an open-weight model the same as an open-source model?

No. Open-weight means the provider gives you the trained parameters to run and usually fine-tune, but typically not the full training data or training code. Open-source, per the OSI’s 2024 definition, requires use for any purpose without permission plus detailed data information and complete training code. Most popular models (including Llama) are open-weight under custom licenses, not open-source. Always read the specific model’s license.

Are open-weight models good enough to replace frontier models?

For many regulated workflows, yes; for the hardest ones, not yet. As of late 2025, Epoch AI measured the best open-weight models lagging the closed state of the art by roughly three to four months, with single-digit gaps on knowledge and reasoning. The honest answer depends on the task, so evaluate candidates against your own traced cases rather than relying on public benchmarks.

How do we run an open-weight model privately?

Serve it with an inference engine inside your own environment. vLLM is the usual starting point for its balance of performance and operational simplicity; TensorRT-LLM offers higher throughput on NVIDIA hardware at the cost of more complexity; Ollama and llama.cpp suit local and edge use. Put whichever engine you pick behind a single AI gateway so identity, budgets, redaction, and logging are enforced in one place.