Benchmarks don’t audit themselves. Epoch AI’s FrontierMath did, and the results are worth sitting with. According to Epoch AI’s FrontierMath: Tiers 1-4 benchmark page, Version 2 of the dataset is now live following a systematic AI-assisted review of the original problem set. Epoch AI, arguably the most credible independent AI evaluation organization operating today, found that approximately 42% of the original benchmark problems contained errors serious enough to require correction or removal, per the organization’s release documentation.
The numbers, according to Epoch AI’s release: roughly 123 problems corrected across Tiers 1-3, 12 in Tier 4, and approximately 12 problems dropped from the dataset entirely. Approximately 338 problems remain in the revised v2 dataset, 295 in Tiers 1-3, 43 in Tier 4. These figures are attributed to Epoch AI as the primary authority; the specific counts weren’t extractable from the retrieved page content and should be verified directly from Epoch AI’s v2 release documentation before citing them in formal reports.
The full dataset remains gated, per Epoch AI’s standard benchmark policy to prevent contamination. Twelve sample problems are publicly available.
The leaderboard for v2, including updated model scores, is live at the Epoch AI benchmark page. Specific percentage scores for individual models require direct human verification from the live leaderboard before publication; that data was sourced to a URL that couldn’t be confirmed during ‘s verification. Check the Epoch AI leaderboard directly for current rankings.
The part nobody mentions when a benchmark update lands: historical comparisons break. A model that scored X on FrontierMath v1 is being scored against a different problem set in v2. Critics have noted that v1-to-v2 score comparisons are difficult to interpret, a jump in a model’s score may reflect genuine capability improvement, the removal of flawed problems that the model was getting wrong for the wrong reasons, or both. That interpretive uncertainty is a feature of honest evaluation, not a failure of it.
What this means for teams using benchmark scores to make decisions: Epoch AI’s willingness to run a rigorous self-audit and publish the results is the baseline standard, not the exception. Before using any benchmark score in a vendor evaluation or procurement process, confirm whether the benchmark has undergone independent integrity review, who runs the evaluation, and when the dataset was last audited. FrontierMath v2 just reset its own baseline, verify that any scores you’re comparing are from the same version.
The timing is significant. This update drops the same week as VibeThinker-3B’s self-reported claims of frontier-level verifiable reasoning performance. The FrontierMath v2 audit is a reminder that even carefully constructed, expert-curated benchmarks require active maintenance. Self-reported benchmarks, run by the model’s own developers on datasets they control, carry additional uncertainty that this audit makes visible by contrast.
Watch for updated model rankings on the FrontierMath v2 leaderboard, and watch for independent researchers stress-testing the corrected problem set. The corrected benchmark is now more trustworthy than v1. The question is whether the teams building on benchmark scores have updated their reference data.
Don’t use v1 FrontierMath scores in current evaluations. Pull the v2 leaderboard directly from Epoch AI, and treat any vendor-cited benchmark number that predates June 12, 2026 as requiring a version check.