Self-reported benchmarks. Read carefully.
JetBrains released Mellum2 as an open-weight model under the Apache 2.0 license on June 1, targeting developers who want a local inference option for coding, debugging, multi-step reasoning, tool use, and agentic workflows. The architecture is a 12-billion-parameter Mixture-of-Experts model. That number understates what actually runs per inference: Mellum2 activates only 2.5 billion parameters per token, routing each token to 8 of its 64 expert subnetworks. Less compute per forward pass, lower latency per token, that’s the MoE argument for developer tooling, where you’re generating dozens of code completions per session rather than one long essay.
The architecture is confirmed. JetBrains’ technical paper on arXiv corroborates the 12B/2.5B active, 64-expert/8-active structure, along with Grouped-Query Attention using 4 KV heads and Sliding Window Attention. JetBrains reports a 128K context window and approximately 10.6 trillion tokens in the training dataset, both figures from the vendor’s technical report, not independently verified.
AIME 2025+2026 Score (per JetBrains technical report)
Disputed Claim
The benchmark profile is where this gets interesting. According to JetBrains’ technical report, Mellum2 scores 69.9% on LiveCodeBench v6. For a coding-specialized model, that’s the number that matters most to its target audience. JetBrains also reports a “Thinking” variant trained via RLVR, which scores 58.4% on AIME 2025 and 2026. The catch is that Qwen3.5-4B, a general-purpose dense model less than a third of Mellum2’s total parameter count, reportedly scores 68.3% on the same AIME evaluation, per the same technical report. A smaller dense model outperforming a larger MoE on mathematical reasoning. JetBrains disclosed this comparison themselves. That’s worth noting.
JetBrains claims up to 2x faster inference compared to dense models of similar parameter count. This figure appears in the vendor’s announcement but wasn’t confirmed in the arXiv abstract content available for review, treat it as a vendor claim until independent benchmarks emerge.
Deployment is functional but not frictionless. vLLM supports the model natively. Standard Transformers-based pipelines work but may carry overhead from the custom architecture. Early community reports suggest compatibility challenges with Ollama due to that custom MoE structure, if you’re running local inference on Ollama, verify compatibility before building around it. The Hugging Face model card would normally clarify this, but that source was unavailable at the time of this brief’s production.
What to Watch
What to watch: Independent benchmark evaluation of the LiveCodeBench 69.9% figure is the key trigger. JetBrains published this via their own technical report, which is appropriate and transparent, but the performance claim hasn’t cleared an independent reproduction. Epoch AI coverage or a community reproduction on a standardized harness would move this from vendor-qualified to confirmed. Watch for those in the weeks following release.
TJS synthesis: Don’t deploy this model for mathematical reasoning tasks where Qwen3.5-4B is already in your stack. For coding-specific agentic pipelines, code generation, editing, debugging, tool-use sequences, Mellum2’s MoE architecture makes a credible case on latency grounds, and Apache 2.0 removes the license friction. Wait for independent LiveCodeBench confirmation before replacing your current coding model. If vLLM is already your inference layer, this is worth a controlled evaluation this month.