MoE vs. Dense for Developer Tools: What Mellum2's Benchmark Split Reveals About Choosing Your Coding Model Stack

June 3, 2026 7 min read JetBrains Qualified Moderate

Tech Jacks Solutions AI News Coverage

JetBrains' Mellum2 scores 69.9% on LiveCodeBench and trails a four-billion-parameter dense model on mathematical reasoning, and JetBrains disclosed both numbers in the same technical report. That's not a contradiction. It's a precise description of what MoE architecture actually trades, and understanding the tradeoff is more useful than the headline benchmark.

jetbrains mellum2 open-source-ai mixture-of-experts moe-architecture coding-models agentic-ai llm-benchmarks vllm livecodesbench

LiveCodeBench v6, 69.9% (vendor-reported)

Key Takeaways

Mellum2's MoE architecture activates only 2.5B of 12B parameters per token, the engineering rationale for its coding-task latency advantage, confirmed via JetBrains' arXiv technical paper
According to JetBrains' technical report, Mellum2 scores 69.9% on LiveCodeBench v6 but the Thinking variant trails Qwen3.5-4B (68.3%) on AIME at 58.4%, a deliberate architectural tradeoff, not a defect
Apache 2.0 licensing and native vLLM support make Mellum2 production-eligible without license friction; Ollama compatibility issues require resolution before individual developer adoption can scale
All benchmark figures are vendor-reported, independent evaluation is the key signal to wait for before replacing existing coding models in production

Model Release

Mellum2 (+ Thinking variant)

OrganizationJetBrains

TypeOpen Source LLM

Parameters12B total / 2.5B active per token (64 experts, 8 active)

Benchmark[SELF-REPORTED] LiveCodeBench v6: 69.9% | AIME (Thinking): 58.4%, per arXiv:2605.31268

AvailabilityOpen-weight, Apache 2.0, vLLM native

AIME 2025+2026 Score (per JetBrains technical report, vendor-reported)

Mellum2-Thinking (12B MoE, 2.5B active)

58.4%

Qwen3.5-4B (dense, general-purpose)

68.3%

Verification

Qualified JetBrains arXiv technical report (2605.31268); JetBrains blog (page content not retrieved at publication) All benchmark figures are self-reported by JetBrains. No independent reproduction or Epoch AI evaluation available. Hugging Face model card source was unavailable.

A model that loses to a smaller competitor on one benchmark while leading on another isn’t a failed release. It’s a legible engineering decision. Mellum2’s benchmark profile, strong on LiveCodeBench, behind Qwen3.5-4B on AIME, is exactly what you’d expect from a Mixture-of-Experts architecture purpose-built for software development tasks. The problem isn’t the architecture. The problem is that most evaluation frameworks don’t separate those two things clearly, which means practitioners end up comparing numbers that weren’t designed to compete.

This deep-dive answers one question: what does Mellum2’s benchmark split tell practitioners about when to use MoE architecture for developer tooling, and when not to?

What JetBrains Released

Mellum2 launched June 1 as an open-weight model under the Apache 2.0 license. The architecture is a 12-billion-parameter Mixture-of-Experts model. Per JetBrains’ technical paper on arXiv (2605.31268), it activates 2.5 billion parameters per token by routing each token to 8 of its 64 expert subnetworks. The model uses Grouped-Query Attention with 4 KV heads and Sliding Window Attention. JetBrains reports a 128K context window and approximately 10.6 trillion tokens in the training dataset, both figures from the vendor’s technical report, not independently verified. A separate “Thinking” variant trained via RLVR is reported as available.

The target use cases are specific: code generation, code editing, debugging, multi-step reasoning, tool use, function calling, and agentic coding workflows. Not general-purpose instruction following. Not mathematical reasoning. Software development, end to end.

Deployment options at launch: native vLLM support, Transformers-based pipelines (with architecture-related overhead), and reported compatibility issues with Ollama due to the custom MoE structure. Apache 2.0 means no commercial restrictions. If you run vLLM, this model is production-eligible without a license conversation.

The MoE Architecture Tradeoff, What 2.5B Active Parameters Actually Means

Total parameter count is a marketing number. Active parameters per forward pass is the engineering number.

When Mellum2 processes a token, it routes that token through 8 of its 64 expert subnetworks. Only those 8 subnetworks, representing roughly 2.5 billion parameters, perform computation. The other 56 subnetworks sit idle for that token. This means Mellum2’s per-token compute cost is much closer to a 2-3B dense model than to a 12B dense model, despite the full parameter count being 12B.

For developer tooling, this matters in two ways. First, latency: an agentic coding pipeline that runs hundreds of completions per session accumulates per-token latency differences quickly. Faster per-token throughput compresses that accumulation. Second, memory: the full weight of all experts loads into memory, but only a fraction activates per inference. On constrained hardware, a developer workstation with 24GB VRAM rather than a multi-GPU server, the active compute footprint of MoE can be more forgiving than a dense model of comparable quality.

JetBrains claims up to 2x faster inference versus dense models of similar capability. This figure comes from the vendor’s announcement and wasn’t confirmed in the arXiv abstract content available at publication time. Take it as a directional claim, not a tested number. The architectural logic supports the direction; the magnitude is unverified.

The catch is that MoE expert routing introduces overhead of its own. Expert selection adds a routing computation step. Memory access patterns are less cache-friendly than dense models. In short-context or single-completion use cases, a one-shot code explanation, a single function docstring, the routing overhead can erode the per-token speed advantage. MoE gains compound over long sessions with many completions. That’s precisely the agentic coding workflow JetBrains is targeting.

The Benchmark Split, Reading the Numbers Honestly

According to JetBrains’ technical report, Mellum2 scores 69.9% on LiveCodeBench v6. LiveCodeBench evaluates models on realistic coding tasks sourced from competitive programming contests, problems that require understanding code structure, fixing bugs, generating working implementations. This is the benchmark closest to what the model was designed to do.

The Thinking variant scores 58.4% on AIME 2025 and 2026, combined. AIME (American Invitational Mathematics Examination) tests multi-step mathematical reasoning, competition-level math, not software engineering. The same technical report shows Qwen3.5-4B, a general-purpose dense model at less than a third of Mellum2’s total parameter count, scoring 68.3% on the same AIME evaluation.

What Changes With MoE Architecture for Developer Tooling

Dense model (e.g., 12B parameters)

All 12B parameters activate per forward pass, predictable compute cost, straightforward deployment, strong on general reasoning tasks

→

Mellum2 (12B MoE, 2.5B active)

2.5B parameters activate per token via expert routing, lower per-token compute, latency advantage compounds over long coding sessions, trades general math reasoning for coding-task specialization

Disputed Claim

Mellum2 delivers up to 2x faster inference than comparable dense models

Vendor-stated in JetBrains announcement; not confirmed in available arXiv abstract or independent benchmark

Architecture supports the directional claim. Magnitude unverified. Run inference benchmarks on your target hardware before citing this figure in architecture decisions.

A 4B dense model beating a 12B MoE on math. This is the comparison JetBrains included in their own report.

It’s not a surprise if you understand what’s happening. Qwen3.5-4B is trained as a general reasoning model with strong mathematical instruction-following. Mellum2 is trained on a coding-specific corpus with coding-specific reward signals. Mathematical olympiad problems require a different kind of generalization than code debugging. The model that activates only 2.5B parameters per token and routes through coding-specialized experts is not the tool you want for AIME, even if it’s substantially larger by parameter count.

What this reveals: benchmark selection reveals use case fit. A model that scores well on LiveCodeBench and poorly on AIME is telling you exactly where to deploy it. Treat Mellum2’s benchmark split as a specification sheet, not a report card.

All benchmark figures cited here are vendor-reported per JetBrains’ technical report. No independent evaluation of Mellum2’s performance was available at the time of production. The LiveCodeBench 69.9% figure is plausible and specific, but practitioners should treat it as qualified until a third-party reproduction confirms it.

Deployment Reality, Who Should Deploy This, and on What Stack

vLLM is the clear path. Native support is confirmed and the inference server handles MoE routing without custom configuration. If vLLM is already your inference layer for other models, Mellum2 integrates within your existing stack.

Transformers-based pipelines work but carry overhead. The custom MoE architecture doesn’t map cleanly to the generic Transformers inference path. Expect some friction in configuration, and don’t expect out-of-the-box speed parity with vLLM on the same hardware.

Ollama is a problem. Early community reports flag compatibility issues with the custom architecture. JetBrains hasn’t issued official guidance on Ollama support as of this brief. If your development environment runs on Ollama for local inference, a common setup for individual developers, hold off on Mellum2 until official compatibility confirmation arrives or community workarounds stabilize.

The hardware picture: JetBrains hasn’t disclosed the recommended hardware specification for self-hosted deployment. The 12B parameter count requires loading all expert weights into memory even when only 2.5B activate per forward pass. Consumer-grade workstations with 24GB VRAM may handle this in lower-precision formats (Q4 or similar quantization), but there’s no confirmed guidance from JetBrains on minimum viable hardware at publication time. Plan for this unknown before committing to deployment.

Where Mellum2 Sits in the Open-Source Coding Model Field

The open-source coding model space has become crowded fast. Qwen3-Coder, Codex (now on Bedrock), various Llama derivatives, and purpose-built coding specialists are all competing for the same developer mindshare. What Mellum2 adds is a specific combination: MoE architecture with coding-task specialization, from a company with direct IDE distribution through JetBrains’ own tooling ecosystem.

That last point matters more than the benchmark. JetBrains ships to millions of developers through IntelliJ IDEA, PyCharm, GoLand, and related IDEs. An open-source Mellum2 available through standard model hosting gives JetBrains a pathway to developer-hosted inference that complements their cloud-hosted AI features. Open-sourcing the model isn’t just goodwill, it builds familiarity with the model family among developers who might otherwise use a competitor’s coding assistant.

What to Watch

Independent LiveCodeBench v6 reproduction of 69.9% claim4-6 weeks

Epoch AI evaluation listing for Mellum26-8 weeks

JetBrains official Ollama compatibility guidance2-3 weeks

JetBrains IDE product announcements integrating Mellum2Q3 2026

Analysis

Mellum2's benchmark split is a deployment specification: strong on coding tasks (LiveCodeBench 69.9%), behind a smaller dense model on math (AIME 58.4% vs. Qwen3.5-4B 68.3%). The model is doing exactly what it was designed to do, which means teams need to match deployment context to capability profile, not treat total parameter count as a quality signal.

The broader agentic coding tool pattern is worth tracking alongside this release. Multiple vendors are converging on similar architectural decisions: smaller active-parameter footprints, tool-use specialization, long-context handling for multi-file codebases. Mellum2 fits that pattern.

What to Watch

Three signals will clarify Mellum2’s real-world position in the coming weeks:

Independent LiveCodeBench evaluation is the most important. JetBrains’ 69.9% figure is specific enough to be testable, and the community has the tools to reproduce it. An independent result within 2-3 percentage points upgrades the claim from vendor-qualified to confirmed. A significant gap in either direction tells you something important about the technical report’s conditions.

Ollama compatibility resolution, either official support or a stable community workaround, will determine whether individual developer adoption accelerates or stalls. The developer who can run Mellum2 locally on their laptop is a different deployment context than the team running vLLM on a GPU server.

JetBrains IDE integration announcements are the longer-term signal. An open-source release that feeds back into JetBrains’ own tooling as the recommended local model changes the competitive dynamic for coding assistants. Watch for product announcements in the next two to three months.

TJS Synthesis

Deploy Mellum2 where it was designed to work: code generation, debugging, tool-use sequences, agentic coding pipelines on vLLM. Don’t deploy it where Qwen3.5-4B already performs better at a fraction of the memory cost, mathematical reasoning tasks, general instruction following, or contexts where AIME-style multi-step logic is the bottleneck.

The benchmark split isn’t a warning sign. It’s a deployment specification. JetBrains built a coding specialist and published the evidence that it’s a coding specialist. That’s more useful than a model claiming top scores on every benchmark. Wait for independent LiveCodeBench confirmation before replacing your current coding model in production, but start the evaluation now. Apache 2.0 licensing means no friction on the business side.

View Source

More Technology intelligence

View all Technology

Gallery

Contacts