Two days ago, the GPT-5.5 Pro story had a gap.
The super app architecture and $30 pricing tier were reported. The benchmark numbers were pending. Per Epoch AI’s independent evaluation, that gap is now closed for one key metric: GPT-5.5 Pro scores 159 on the ECI, the Epoch Capabilities Index, setting a new record on that measure.
The ECI matters because Epoch AI runs its evaluations independently. It doesn’t take vendor submissions and publish them, it runs models against its own evaluation suite. That’s the distinction that makes an ECI score more reliable than a benchmark table in a model release blog post. ECI=159 is a confirmed, independently generated finding.
The FrontierMath results need to be read differently. According to the source data, GPT-5.5 Pro scored 52% on FrontierMath Tiers 1-3 and 40% on Tier 4, up from 50% and 38% respectively. FrontierMath is an Epoch AI benchmark, but the key question is whether Epoch ran GPT-5.5 Pro through it independently, or whether these numbers reflect OpenAI’s own evaluation submitted for publication. That distinction hasn’t been confirmed in the available source material. The results are being presented here with attribution to the source, “according to the evaluation data”, not as independently verified findings on par with the ECI score.
Why does it matter for practical decision-making? Enterprise buyers use benchmark tables to compare models across vendors. A Tier 4 FrontierMath score of 40% is a notable capability claim, it describes performance on advanced mathematics problems at difficulty levels that have challenged frontier models. But the value of that number in a vendor comparison depends on whether it was produced under controlled third-party conditions. Treat the FrontierMath figures as directionally informative until the evaluation methodology is confirmed.
Epoch AI’s database now tracks more than 3,200 models, per data updated April 27. That context matters for what ECI=159 actually represents: it’s a record against the largest independently tracked frontier model dataset currently available, not a record against a curated vendor selection.
What to watch: Epoch AI typically publishes full evaluation methodology alongside its model assessments. Checking the Epoch AI leaderboard directly will confirm whether FrontierMath scores were independently run, and that verification step is worth doing before using the Tier 4 figure in any model selection decision. The benchmark ceiling debate in AI evaluation isn’t resolved by this result; it’s a data point in an ongoing conversation about whether standard evals can still meaningfully differentiate frontier models.
The ECI record is real and independently verified. The FrontierMath numbers are plausible and consistent with the model’s positioning. The distinction between those two sentences is the difference between a confirmed capability and a vendor narrative that deserves scrutiny.