The voluntary era of US AI safety commitments may be ending. What replaces it is still forming.
Reporting from The Hindu and Tom’s Hardware describes Google DeepMind, Microsoft, and xAI as having agreed to allow US government pre-release testing of their AI models. The framework reportedly builds on the CAISI pre-deployment review program announced and covered here on May 5. The structure involves Model Access Agreements, formal frameworks that reportedly govern what government testers can do with pre-release model access, what findings trigger holds on public release, and what obligations participating labs incur.
None of this has been confirmed through official lab announcements or published government policy text. Every characterization in this brief that goes beyond established prior coverage is journalism-tier sourcing and must be treated accordingly. That caveat isn’t boilerplate, it’s the single most important thing compliance teams need to understand before using this information in their planning.
What the policy architecture reportedly requires
Model Access Agreements, as described in secondary reporting, appear to be the operational mechanism through which pre-release testing works. The structure, as reported, involves three components: access protocols (what government testers receive and under what conditions), evaluation criteria (what testing assesses and what findings constitute a “hold” on public release), and obligation triggers (what happens if testing surfaces a safety concern).
The legal authority basis for any mandatory version of this framework has not been confirmed. The White House was reported on May 8 as drafting a mandatory pre-release AI review order. Whether that order has been issued, what authority it invokes, and what legal obligations it creates for non-participating labs are all unconfirmed as of this writing. Voluntary CAISI participation is categorically different from a mandatory legal obligation, and that distinction determines whether the divergence in lab postures has compliance consequences or is simply a business decision.
Lab posture map
Google DeepMind: Reportedly agreed to government pre-release testing per CAISI framework participation. Google DeepMind’s participation follows the pattern of its broader US government AI cooperation posture, the company has been among the more engaged frontier labs on policy dialogue. No official confirmation of specific MAA terms is available.
Microsoft: Reportedly agreed to government pre-release testing per CAISI framework participation. Microsoft’s posture is consistent with its established position as a major federal AI contractor through its Azure Government infrastructure and OpenAI partnership. No specific MAA terms confirmed.
xAI: Reportedly agreed to government pre-release testing per CAISI framework participation. xAI’s participation is notable given Grok’s more limited federal deployment history compared to the other two labs. No specific MAA terms confirmed.
Anthropic: The reported posture here is different, and the sourcing is weaker.
According to a single report from indiatimes.com that could not be independently corroborated, Anthropic has reportedly declined to unlock certain model capabilities for federal use. That report characterizes the consequence as a Pentagon supply chain risk designation, a formal classification that, if accurate, carries procurement implications for any enterprise deployer whose AI stack includes Anthropic models in federally-adjacent contexts.
The Anthropic-Pentagon dispute has extensive prior coverage here. What it means when the Pentagon calls an AI company a security risk was analyzed here in prior coverage. Anthropic’s own legal action against the Pentagon was covered here. The Wyden legislative effort to restore Anthropic’s federal access, and the White House executive order attempting the same, have both been reported. What appears new this cycle is the reported characterization that the dispute’s current status involves Anthropic declining to unlock capabilities, not just declining to sign a specific contract.
That characterization requires independent confirmation before any compliance or procurement team acts on it.
The Mythos context: what the reporting says and doesn’t say
Secondary reporting has characterized Anthropic’s Mythos model, reportedly withheld from public release due to cybersecurity concerns, as a catalyst for the administration’s increased focus on pre-release vetting. This characterization has not been confirmed by official sources.
Prior coverage established that Mythos has been the subject of restricted access architecture and NSA involvement. The Mythos access and breach investigation coverage from April 26 remains the authoritative reference for what is confirmed about the model’s status. The “thousands of software vulnerabilities” capability attributed to Mythos in some reporting is a Wire inference, there is no confirmed technical specification or independent capability evaluation supporting that specific claim. Treat it as reported, not verified.
What the Mythos coverage does establish, to the extent its prior coverage is accurate, is that frontier models capable of offensive cybersecurity applications exist and are being actively restricted from public release. That’s the policy driver, regardless of whether the specific capability claim about Mythos is precisely accurate.
The compute context: scaling faster than the framework
Epoch AI’s compute tracking as of May 8 shows more than 30 models now exceed the 10^25 FLOP threshold used in EU regulations to define systemic risk for general-purpose AI. US policy uses different criteria. But the compute acceleration dynamic is the same. A vetting framework designed around a handful of frontier models, which is what the CAISI architecture appears to contemplate – will face pressure as the population of models that could plausibly meet any threshold-based vetting trigger grows rapidly.
What this means if voluntary becomes mandatory
The compliance and governance posture shift between voluntary CAISI participation and mandatory vetting is significant in three dimensions.
Disclosure obligations. A mandatory framework would almost certainly require labs to proactively notify government testers before public release, not just offer access voluntarily. That changes the development timeline calculus. Release dates become contingent on testing completion.
Capability documentation. If government testing is triggered by capability thresholds, labs will need to maintain formal documentation of model capability profiles, the kind of documentation that doesn’t currently exist in standardized form. Expect that documentation requirement to be a contested element of any mandatory framework design.
Supply chain implications for deployers. Enterprise deployers whose AI stack includes frontier models from labs with different participation postures face a supply chain risk question. If one lab’s models become unavailable during a testing hold, or if a lab is formally designated a supply chain risk, deployers who have built workflows around those models need contingency architecture. That’s not a theoretical risk if a supply chain designation is confirmed.
What to watch
Three events will clarify the policy picture significantly. First: whether the White House issues a formal legal instrument converting the CAISI framework into a mandatory requirement, and what legal authority that instrument invokes. Second: whether Anthropic’s reported Pentagon dispute moves toward resolution or further escalation, and what the federal court proceedings produce. Third: whether any other frontier lab, OpenAI being the most obvious candidate, takes a public position on the vetting framework that signals broader industry alignment or resistance.
TJS synthesis
The frontier AI field has always had different relationships with US government. What’s new is that those differences are being formalized into a policy framework that may soon have legal teeth. Three labs reportedly in, one reportedly out, that divergence is a preview of the compliance landscape that emerges when voluntary becomes mandatory. For enterprise deployers, the useful question now isn’t which lab is right and which is wrong. It’s: if the lab whose models power your most critical workflows becomes unavailable in a federal context, how long would it take you to replatform? That answer, not the policy outcome, is the one worth knowing today.