This is a follow-up to our May 5 coverage of the GPT-5.5 Instant launch. That brief covered the release itself. This one covers what the accompanying documentation actually says.
What the System Card confirms
OpenAI released a System Card alongside GPT-5.5 Instant documenting the safety evaluation methodology and mitigation measures applied before deployment. System Cards are OpenAI’s formal mechanism for disclosing what the company evaluated, what it found, and what it did about it. The launch materials confirm GPT-5.5 Instant became the default model in ChatGPT upon release, a deployment decision that, by OpenAI’s own process, requires a System Card to accompany it.
The System Card documents safety mitigations applied to the model. It exists because GPT-5.5 Instant is in production, at scale, as the default experience for ChatGPT’s user base.
What remains pending
The MMLU benchmark figure is where enterprise teams need to slow down. According to OpenAI’s internal evaluation, GPT-5.5 Instant scores 88.2% on MMLU. That figure is self-reported. Epoch AI’s independent evaluation has not been published.
The distinction matters for procurement decisions. Self-reported benchmarks reflect the vendor’s own testing conditions, sample selection, prompt formatting, and evaluation methodology all affect scores in ways that aren’t always disclosed. Independent evaluation from Epoch AI applies a standardized methodology and lets buyers compare across models on a consistent basis. Until that evaluation is published, the 88.2% figure is a vendor claim, not a verified score.
Our prior coverage of Epoch AI’s evaluation of GPT-5.5 Pro, which confirmed an ECI score of 159, illustrates what independent verification adds. The Pro evaluation is done. The Instant evaluation is pending with no published timeline.
The comparison that matters
| Element | Status |
|---|---|
| GPT-5.5 Instant launched as ChatGPT default | Confirmed |
| System Card published with safety mitigations | Confirmed |
| MMLU score: 88.2% | Self-reported (OpenAI internal evaluation only) |
| Epoch AI independent evaluation | Pending, no timeline published |
| OpenAI describes model as “smarter, clearer, more personalized” | Vendor characterization, not independently assessed |
The practitioner consideration
System Cards are useful, they’re among the more structured transparency mechanisms frontier labs produce. But a System Card documents what the vendor chose to evaluate and how. For compliance teams assessing model deployment risk, the System Card is the starting point, not the endpoint. The practical gap here: most enterprise procurement frameworks require vendor documentation (which the System Card provides) but don’t yet mandate independent evaluation before deployment approval. Until Epoch AI publishes its findings, buyers are working with one data source.
What to watch
Epoch AI’s evaluation of GPT-5.5 Instant, when published, will either corroborate the 88.2% MMLU self-report or reveal a gap. Either outcome is informative. A confirmed score validates OpenAI’s evaluation methodology for this model family. A discrepancy would raise questions about the self-reporting process that buyers should factor into future procurement cycles.
TJS synthesis
The System Card is not the same as independent verification, and the fact that GPT-5.5 Instant is already the default ChatGPT model means this evaluation gap exists at production scale. Enterprise teams deploying GPT-5.5 Instant aren’t waiting for Epoch AI’s report; they’re making decisions now. The System Card gives them the documentation layer. The benchmark gap is the risk layer they should account for separately.