The EvalEval Coalition published a cost analysis report on Hugging Face’s blog around April 29, 2026. No source URL was provided with this item, all figures below are attributed to the EvalEval Coalition’s report and carry that qualification. The organization is a named entity and Hugging Face is a known publication venue; the figures are specific enough to be verifiable once the URL is confirmed.
The numbers: according to the EvalEval Coalition’s report, a single run of the Holistic Agent Leaderboard costs approximately $40,000, covering 21,730 agent rollouts across nine models. The same report states frontier model runs on GAIA exceed $2,800 per instance. The Coalition identified a 33-fold cost spread depending on scaffold choice for otherwise identical tasks. That last figure is the one with the most structural weight. It means two teams running the same evaluation with different scaffolds can produce results that cost 33 times as much on one side as the other, and yet both results appear on the same leaderboard, with no cost-of-production label attached.
This is not primarily a budget story. It’s a comparability story. The Coalition’s report is surfacing something the hub’s benchmark series has been approaching from a different angle: the infrastructure of AI evaluation is failing to produce results that can be meaningfully compared across organizations with different resources. The prior coverage of “The Benchmark Ceiling” examined why standard evals are failing frontier models on capability grounds. The EvalEval Coalition’s report adds a second dimension: they’re also failing on economic grounds.
Why it matters for practitioners: if you’re a developer or enterprise team relying on public leaderboard rankings to make model selection decisions, the cost data should shift how you read those rankings. A team that ran HAL at $40,000 with one scaffold may be ranked above or below a team that ran it at $1,200 with a different scaffold, producing a result that’s not apples-to-apples even when the benchmark name is identical. The leaderboard shows the score. It doesn’t show the scaffold. It doesn’t show the cost. It doesn’t show the organizational resources behind the run.
The access inequality dimension is worth stating plainly. Academic labs, open-source maintainers, and smaller AI companies cannot routinely spend $40,000 per evaluation run. The teams that can are the large frontier labs and the best-funded enterprises. This means the leaderboards that compliance teams and investors use to compare models are increasingly populated by results that only a subset of market participants can generate. That’s a structural problem for benchmark credibility, and it compounds the vendor-reporting problem that prior hub coverage has documented in the medical AI context.
What to watch: whether the EvalEval Coalition proposes evaluation cost standardization or scaffold transparency requirements alongside this report, the diagnosis is useful, but the value is in any remediation framework they attach to it. Also worth tracking: whether evaluation platforms begin publishing cost-of-run metadata alongside scores, which would at least make the scaffold variable visible to users reading leaderboard results.
The $40,000 evaluation run is a data point. The 33x scaffold spread is the structural signal. Independent AI assessment is becoming a resource that shapes who can credibly compete on public benchmarks, and that shapes everything downstream from model selection to regulatory compliance to investor due diligence.