GLM-5.2 Tops the Artificial Analysis Intelligence Index for Open-Weights, But at 43,000 Tokens Per Task

June 18, 2026 3 min read Artificial Analysis Partial Moderate

Tech Jacks Solutions AI News Coverage

Independent evaluator Artificial Analysis has placed Z.ai's GLM-5.2 at the top of its Intelligence Index v4.1 open-weights category, the first time an open-weights model has led this index. The token consumption profile that earned it the ranking also sets a cost ceiling that changes the deployment math for most teams.

open-source-ai glm-5-2 z-ai artificial-analysis intelligence-index open-weights coding-models benchmark-evaluation epoch-pending

Avg output tokens per task, 43,000

Key Takeaways

Per Artificial Analysis's Intelligence Index v4.1, GLM-5.2 leads the open-weights category with a score of 51, the first open-weights model to reach this position on the index
Artificial Analysis reports an average of 43,000 output tokens per task; at Z.ai's published rate of $4.40/M output tokens, that's approximately $0.19 per task
SWE-bench Pro (62.1%), FrontierSWE (74.4%), AIME 2026 (99.2%), and GPQA-Diamond (91.2%) are Z.ai self-reported figures, no independent confirmation is available at time of publication
Epoch AI's independent evaluation is pending; the database was updated June 17 and remains the meaningful checkpoint before deployment decisions

Model Release

GLM-5.2

OrganizationZ.ai

TypeOpen Source LLM

Parameters744B total / ~40B active (MoE)

Benchmark[SELF-REPORTED] SWE-Bench Pro: 62.1% per Z.ai technical report

AvailabilityZ.ai API, OpenRouter, Featherless

Verification

Partial Artificial Analysis (independent evaluator) + Z.ai model card; Epoch AI evaluation pending Benchmark scores for SWE-bench Pro, FrontierSWE, AIME 2026, and GPQA-Diamond are Z.ai self-reported. No independent confirmation available at publication.

This is a follow-up to TJS’s June 17 coverage of the GLM-5.2 release. The weights are out. Now the independent numbers are in.

Artificial Analysis published its Intelligence Index v4.1 evaluation on June 16, placing GLM-5.2 first among open-weights models with a score of 51 on an index that also includes closed-source frontier models. The evaluation incorporated GDPval-AA v2, τ³-Banking, and Terminal-Bench v2.1 benchmarks. Per Artificial Analysis’s methodology, this makes GLM-5.2 the leading open-weights model on a multi-task evaluation suite, not just a coding leaderboard.

The catch is what it costs to run. Artificial Analysis reports an average of 43,000 output tokens per task on the Intelligence Index evaluation. At Z.ai’s published pricing of $4.40 per million output tokens, that’s roughly $0.19 per task on benchmark conditions. That figure compounds fast at production scale. A workflow running 1,000 agent tasks per day would spend approximately $190 daily on output tokens alone, before factoring in input, caching, or infrastructure.

Z.ai’s model card specifies a 1-million-token context window and publishes pricing at $1.40 per million input tokens, $4.40 per million output tokens, and $0.26 per million cached tokens. The model is a Mixture of Experts architecture with approximately 744 billion total parameters and approximately 40 billion active, per architecture data corroborated by independent sources. It’s available via Z.ai’s API, OpenRouter, and Featherless.

GLM-5.2 Benchmark Claims: What's Confirmed vs. Vendor-Reported

Benchmark	Score	Source	Status
Intelligence Index v4.1 (open-weights)	51	Artificial Analysis	Independent, confirmed
GDPval-AA v2	1524	Artificial Analysis	Independent, confirmed
Code Arena WebDev rank	2nd	Artificial Analysis	Independent, attributed
SWE-bench Pro	62.1%	Z.ai technical report	Self-reported, unverified
FrontierSWE	74.4%	Z.ai technical report	Self-reported, unverified
AIME 2026	99.2%	Z.ai technical report	Self-reported, unverified
GPQA-Diamond	91.2%	Z.ai technical report	Self-reported, unverified

The vendor-reported benchmarks deserve separate treatment. Z.ai reports 62.1% on SWE-bench Pro, comparing favorably to GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%, but these figures come from Z.ai’s own technical report and haven’t been independently confirmed from the evidence available at time of publication. Z.ai also reports 74.4% on FrontierSWE, 99.2% on AIME 2026, and 91.2% on GPQA-Diamond. Treat all of these as Z.ai claims, not settled benchmarks. Epoch AI’s independent evaluation is pending.

Per Artificial Analysis’s evaluation, GLM-5.2 ranks second on the Code Arena WebDev leaderboard, with Claude Fable 5 holding the top position. The GDPval-AA v2 score of 1524 is an Artificial Analysis proprietary benchmark figure, an independent evaluator’s metric, not a universal standard, but useful for comparison within the AA ecosystem.

The Z.ai technical report describes an “IndexShare” architecture designed to reduce per-token compute at 1-million-token context lengths. The 2.9x reduction figure cited is Z.ai’s internal claim and hasn’t been independently verified.

What to Watch

Epoch AI independent evaluation of GLM-5.2, check epoch.ai/data/ai-modelsDays to weeks

Independent SWE-bench Pro confirmation from third-party evaluators2-4 weeks

What to watch:

Epoch AI’s evaluation of GLM-5.2 will be the meaningful independent checkpoint. Until that’s published, the SWE-bench Pro and reasoning benchmark figures remain self-reported. Check epoch.ai/data/ai-models, the database was updated June 17 and GLM-5.2 evaluation may appear there as evaluations complete.

TJS synthesis:

The Artificial Analysis Intelligence Index result is genuinely significant, an open-weights model leading a multi-task evaluation suite that includes closed frontier models is a structural development, not just a leaderboard footnote. But the 43,000-token average consumption is the number that actually governs deployment decisions. Teams evaluating GLM-5.2 for agentic coding workflows shouldn’t start with the benchmark table. Start with your task volume and multiply by $0.19. If that number clears your budget, the independent evaluation is strong enough to warrant a serious pilot. If it doesn’t, wait for Epoch AI’s assessment before committing to migration, the vendor-reported gains on SWE-bench Pro may or may not hold under independent evaluation conditions.