Five agreements. Zero enforcement authority. That’s the architecture the US federal government has built for AI safety evaluation, and with Google DeepMind, Microsoft, and xAI now signed on alongside OpenAI and Anthropic, it’s as complete as it’s ever going to be under its current design.
NIST’s Center for AI Standards and Innovation (CAISI) confirmed the new agreements in an official announcement, adding the three labs to an evaluation framework that OpenAI and Anthropic joined in August 2024 under the Biden administration, per Politico’s reporting on the original agreements. The Biden-era agreements were the pilot. These are the expansion. Together, they cover the companies responsible for every frontier model currently in wide commercial deployment.
The question isn’t whether CAISI now has access. It does. The question is what that access actually means in practice, and what it structurally cannot do.
What the Agreements Cover
Per NIST’s language, the agreements provide government evaluation of AI models before they’re publicly available, along with post-deployment information-sharing. CAISI can review models prior to release. It can receive data after deployment. What the NIST text describes is a framework built on collaboration and voluntary product improvements, not on regulatory authority to approve, delay, or block.
BBC’s reporting on the Google DeepMind, Microsoft, and xAI agreements confirmed the voluntary submission structure: labs agree to share their models for testing through CAISI. The testing is real. The pre-release access is real. The ability to act on what CAISI finds is not, at least not through any statutory mechanism currently in place.
This distinction matters. A lot.
The Voluntary Architecture: What It Can and Can’t Do
Voluntary frameworks operate on reputation, relationship, and shared interest. They work when labs want them to work. The five agreements exist because all five companies see value in demonstrating safety engagement with the federal government, for procurement access, for regulatory goodwill, and because the alternative (a statutory regime designed without their input) would likely be less favorable.
That’s a genuine alignment of incentives, and it produces real outcomes. Pre-release evaluation happens. Findings get shared. The government builds institutional knowledge about frontier model capabilities that it otherwise wouldn’t have. None of that is nothing.
But voluntary frameworks have a structural ceiling. CAISI cannot tell Google DeepMind to delay a release. It cannot publish its evaluation findings without the lab’s cooperation. It cannot impose remediation requirements if it finds something concerning. When the agreement says “voluntary product improvements,” the word “voluntary” carries the full weight of the architecture.
The Hill reported that the agreements involve federal evaluation of models before release. What neither The Hill nor any other verified source confirms is what happens when that evaluation surfaces a problem. The agreements don’t specify a public answer to that question, which tells you something about what the answer is.
The EU AI Act Contrast
The architecture looks different when placed next to what the EU has built. EU AI Act high-risk system requirements, which take effect August 2, 2026 for Annex III categories, mandate third-party conformity assessment for certain use cases, require Quality Management Systems and technical documentation, and establish enforceable penalties for non-compliance. The EU framework doesn’t ask for voluntary cooperation. It imposes legal obligation on any provider whose AI system affects persons within the EU, regardless of where that provider operates.
This isn’t an argument that EU-style regulation is the right answer for the US. It’s an observation about what “coverage” means in the two systems. CAISI’s all-five-labs achievement is significant in the context of the US voluntary architecture. It covers a much smaller fraction of what the EU’s mandatory framework covers by default, because coverage, in the EU system, comes from law rather than from negotiation.
The hub’s EU AI Act compliance coverage addresses what high-risk deployers must complete before August 2. The comparison is worth making explicitly for compliance teams operating across both jurisdictions: the EU framework creates legal obligations that don’t depend on whether your AI provider has signed an agreement with a government body.
What the Voluntary Gap Means for Enterprise Buyers
For organizations that deploy frontier AI in regulated industries, finance, healthcare, critical infrastructure, the CAISI architecture has a specific implication that tends to get lost in the headline coverage. The agreements govern what happens before a model is released. They don’t govern what the deployer must do after they’ve adopted the model.
Enterprise AI buyers are not parties to the CAISI agreements. They don’t get access to CAISI’s evaluation findings. They don’t know what concerns, if any, were raised during pre-release review. The information-sharing that happens between CAISI and the labs doesn’t pass through to the organizations deploying those models at scale.
This creates a structural asymmetry: the federal government has evaluation access that enterprise buyers don’t. Whether that access produces information that those buyers would want, and whether any mechanism exists to get it to them, is a question the current architecture doesn’t answer.
What This Architecture Doesn’t Cover
Three gaps are worth naming precisely.
First, there’s no public disclosure requirement. CAISI’s evaluation results aren’t published. The labs aren’t required to disclose what evaluations found, what was changed as a result, or whether any findings were contested. The framework creates a private information channel between labs and government, not a public accountability mechanism.
Second, the agreements don’t address international coordination. CAISI’s evaluations happen on American frontier models. Models released by non-US labs, which include some of the most capable systems now in deployment, operate entirely outside this framework. The voluntary architecture is, by design, a relationship between the US government and US-based frontier labs.
Third, there’s no defined scope for what “pre-release evaluation” means in terms of what gets tested, by what methodology, or against what standards. NIST’s announcement describes the mechanism without specifying the rigor. What CAISI can evaluate is bounded by what CAISI has the staff, tools, and time to actually do.
What to Watch
The White House was reportedly weighing an executive order formalizing AI model review, as covered in this hub’s prior coverage, which would represent a shift from purely voluntary architecture toward something with more structural authority. Whether that EO materializes, and whether it addresses the enforcement gap or simply codifies the current voluntary model in more formal language, is the legislative and executive development most worth watching in this space.
The other signal to track is whether any of the five labs publicly discloses an evaluation finding – even a favorable one. Voluntary transparency from the labs themselves would tell the market something about what the CAISI process actually produces. Its absence is equally informative.
TJS Synthesis
CAISI’s full frontier coverage is a genuine milestone in the voluntary architecture’s own terms. It demonstrates that the US government can build evaluation relationships with all major domestic AI developers without statutory compulsion, which has value as a proof of concept for international AI governance discussions where mandatory frameworks face political resistance.
But the milestone also clarifies the ceiling. Five agreements without enforcement authority produces comprehensive access and no mechanism to act on what that access reveals. For compliance teams building AI governance programs, the policy implication worth considering is this: the federal voluntary framework doesn’t transfer any of its oversight to you. Your organization’s AI governance program has to account for the fact that the most detailed evaluations of the models you’re deploying aren’t available to you, and may not be available to anyone outside the government and the labs themselves.