Standard AI benchmarks are losing their ability to separate frontier models. That’s the practical implication of Epoch AI’s latest benchmark update, which arrives as multiple major model releases have pushed top scores on MMLU-Pro above 89 percent. Third-party leaderboard data confirms the clustering: Claude Opus 4.7 sits at 89.87 percent and Gemini 3 Pro at 90.10 percent on MMLU-Pro, according to available leaderboard tracking. Epoch AI’s benchmark data reportedly indicates MMLU saturation approaching 92 percent on the broader MMLU benchmark, indicating a ceiling effect, that specific figure could not be confirmed from Epoch AI’s public benchmark page during verification and should be treated as reported, not confirmed.
What the saturation pattern means for teams selecting models: scores at this level are no longer useful for comparing top-tier systems against each other. Two models at 89.8 percent and 90.1 percent are not meaningfully different in the ways that matter for production deployment. The benchmark was designed when a jump from 75 to 85 percent reflected a real capability shift. It wasn’t designed for a world where six to eight frontier models score within one percentage point of each other.
The 10²⁶ FLOP compute threshold is a separate and more consequential data point from this update. Multiple independent sources, including Import AI’s analysis of compute thresholds, confirm this level as the current frontier boundary for leading models. That number carries regulatory weight: under the EU AI Act, systems trained above this threshold face systemic risk classification and the compliance obligations that come with it. The threshold is compute-based, not benchmark-based, a distinction that becomes more significant as benchmark saturation accelerates.That figure conflicts with a prior TJS brief from April 20 and requires source resolution before publication. Readers tracking that number should reference the April 20 Epoch AI systemic risk brief as the current confirmed figure pending clarification.
This update is the fourth time in seven days that Epoch AI data has been the primary subject of a TJS Technology pillar brief. That frequency reflects a real shift in how the industry is using independent benchmark data, not as a product selection tool but as an infrastructure for regulatory and governance decisions. The 10²⁶ FLOP threshold is cited in compliance discussions. The Epoch Capabilities Index is referenced in enterprise procurement decisions. The benchmark saturation story isn’t just a technical footnote, it’s a signal that the evaluation layer for frontier AI is being rebuilt from the ground up.