This is a follow-up to TJS’s June 17 coverage of the GLM-5.2 release. The weights are out. Now the independent numbers are in.
Artificial Analysis published its Intelligence Index v4.1 evaluation on June 16, placing GLM-5.2 first among open-weights models with a score of 51 on an index that also includes closed-source frontier models. The evaluation incorporated GDPval-AA v2, τ³-Banking, and Terminal-Bench v2.1 benchmarks. Per Artificial Analysis’s methodology, this makes GLM-5.2 the leading open-weights model on a multi-task evaluation suite, not just a coding leaderboard.
The catch is what it costs to run. Artificial Analysis reports an average of 43,000 output tokens per task on the Intelligence Index evaluation. At Z.ai’s published pricing of $4.40 per million output tokens, that’s roughly $0.19 per task on benchmark conditions. That figure compounds fast at production scale. A workflow running 1,000 agent tasks per day would spend approximately $190 daily on output tokens alone, before factoring in input, caching, or infrastructure.
Z.ai’s model card specifies a 1-million-token context window and publishes pricing at $1.40 per million input tokens, $4.40 per million output tokens, and $0.26 per million cached tokens. The model is a Mixture of Experts architecture with approximately 744 billion total parameters and approximately 40 billion active, per architecture data corroborated by independent sources. It’s available via Z.ai’s API, OpenRouter, and Featherless.
GLM-5.2 Benchmark Claims: What's Confirmed vs. Vendor-Reported
| Benchmark | Score | Source | Status |
|---|---|---|---|
| Intelligence Index v4.1 (open-weights) | 51 | Artificial Analysis | Independent, confirmed |
| GDPval-AA v2 | 1524 | Artificial Analysis | Independent, confirmed |
| Code Arena WebDev rank | 2nd | Artificial Analysis | Independent, attributed |
| SWE-bench Pro | 62.1% | Z.ai technical report | Self-reported, unverified |
| FrontierSWE | 74.4% | Z.ai technical report | Self-reported, unverified |
| AIME 2026 | 99.2% | Z.ai technical report | Self-reported, unverified |
| GPQA-Diamond | 91.2% | Z.ai technical report | Self-reported, unverified |
The vendor-reported benchmarks deserve separate treatment. Z.ai reports 62.1% on SWE-bench Pro, comparing favorably to GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%, but these figures come from Z.ai’s own technical report and haven’t been independently confirmed from the evidence available at time of publication. Z.ai also reports 74.4% on FrontierSWE, 99.2% on AIME 2026, and 91.2% on GPQA-Diamond. Treat all of these as Z.ai claims, not settled benchmarks. Epoch AI’s independent evaluation is pending.
Per Artificial Analysis’s evaluation, GLM-5.2 ranks second on the Code Arena WebDev leaderboard, with Claude Fable 5 holding the top position. The GDPval-AA v2 score of 1524 is an Artificial Analysis proprietary benchmark figure, an independent evaluator’s metric, not a universal standard, but useful for comparison within the AA ecosystem.
The Z.ai technical report describes an “IndexShare” architecture designed to reduce per-token compute at 1-million-token context lengths. The 2.9x reduction figure cited is Z.ai’s internal claim and hasn’t been independently verified.
What to Watch
What to watch:
Epoch AI’s evaluation of GLM-5.2 will be the meaningful independent checkpoint. Until that’s published, the SWE-bench Pro and reasoning benchmark figures remain self-reported. Check epoch.ai/data/ai-models, the database was updated June 17 and GLM-5.2 evaluation may appear there as evaluations complete.
TJS synthesis:
The Artificial Analysis Intelligence Index result is genuinely significant, an open-weights model leading a multi-task evaluation suite that includes closed frontier models is a structural development, not just a leaderboard footnote. But the 43,000-token average consumption is the number that actually governs deployment decisions. Teams evaluating GLM-5.2 for agentic coding workflows shouldn’t start with the benchmark table. Start with your task volume and multiply by $0.19. If that number clears your budget, the independent evaluation is strong enough to warrant a serious pilot. If it doesn’t, wait for Epoch AI’s assessment before committing to migration, the vendor-reported gains on SWE-bench Pro may or may not hold under independent evaluation conditions.