Mixture of Experts: many experts, only a few wake up
How can a model hold hundreds of billions of parameters yet stay affordable to run? The trick is sparsity: instead of one big feed-forward block, a layer holds many expert sub-networks, and a small router sends each token to only a couple of them. Total capacity grows; the compute spent per token barely moves. See the router decide — right here on the page.
01What a mixture of experts replaces
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
In a standard Transformer block, every token passes through the same single feed-forward sub-network. A mixture of experts (MoE) swaps that one block for several parallel "expert" sub-networks plus a small gating network — usually called the router — that decides which experts handle a given token. The original idea goes back to Jacobs, Jordan, Nowlan & Hinton (1991), where multiple expert networks each learned a subset of cases and a gating network combined them; Shazeer et al. (2017) introduced the modern sparsely-gated MoE layer that made this practical inside deep networks.
One caution worth internalising up front: an "expert" here is a sub-network inside one model, not a separately trained standalone model. There is no "law expert" or "medical expert" you can point to — the experts specialise in subtle, learned ways, and which token goes where is decided automatically by the router.
- MoE replaces one dense feed-forward block with many expert sub-networks + a router that picks which experts run.
- The concept dates to Jacobs et al. (1991); the modern sparsely-gated layer is from Shazeer et al. (2017).
- "Experts" are sub-networks within one model — not full, independent models.
02The key idea: sparse activation
Here is the single most important distinction in this whole lesson, and the one people most often get wrong: total parameters are not the same as active parameters. The router selects only a small number of experts — the top-k — for each token, so only a fraction of the model's parameters actually run for any given token. This is conditional computation: the network decides, per token, which parts of itself to use.
The payoff is that MoE decouples a model's total parameter count from its per-token compute cost. Total capacity can grow very large while the FLOPs spent on each token stay roughly constant. As a concrete reference point, the Mixtral paper (Jiang et al., 2024) describes a model with 8 experts per layer where the router selects 2 per token — so only part of the network runs for any one token even though all eight experts must be held in memory.
- Total parameters = the sum across all experts. Active parameters = only the experts the router picked for this token.
- Selecting top-k experts per token is sparse, conditional computation — the model uses different parts of itself for different tokens.
- The result: capacity scales up; per-token compute stays roughly flat (per Shazeer 2017; GLaM, Du et al. 2021).
03See the router decide
A row of tokens flows into the gating network on the left. For each token, the router scores all N experts and lights up only its top-k — those experts process the token; the rest stay dark. Change k and the routing scheme, then press Route the tokens and watch which experts wake up. The counters show how active parameters per token stay small even as total parameters (all experts) grow. The figures here are illustrative, not measured from any specific model — they exist to make the sparsity idea tangible.
Illustrative only. Bars show the expert load tally across the routed tokens, not a measured benchmark. "Active params / token" assumes experts of roughly equal size, so k of N experts ≈ k/N of the expert parameters.
04How routing works — and how it broke and got fixed
Routing is where most of the engineering lives, because a naive router has a nasty failure mode: it can collapse onto a few popular experts, leaving the rest barely trained and wasting the model's capacity. Different papers attack this differently, and it helps to state which scheme a claim refers to rather than treating MoE routing as one monolithic thing.
- GShard (Lepikhin et al., 2020) — used top-2 routing with automatic sharding to scale sparsely-gated MoE Transformers across many accelerators.
- Switch Transformer (Fedus et al., 2021) — simplified routing all the way to top-1 (one expert per token) to cut communication and complexity.
- Load-balancing loss — an auxiliary term added during training to push the router toward even expert utilisation, the standard remedy for collapse.
- BASE Layers (Lewis et al., 2021) — framed token-to-expert assignment as a balanced linear assignment problem, removing the auxiliary loss and its extra hyperparameters.
- Expert Choice routing (Zhou et al., 2022) — inverted the selection so experts pick their top-k tokens, guaranteeing balanced load and allowing a variable number of experts per token.
- DeepSeek-V3 (2024) — introduced an auxiliary-loss-free load-balancing strategy, balancing experts without the usual extra loss term.
Two more design ideas are worth knowing. DeepSeekMoE (Dai et al., 2024) improves expert specialisation through fine-grained expert segmentation plus shared (always-on) experts that capture common knowledge so the routed experts can specialise without redundancy. And because sparse models can be unstable to train and uncertain to fine-tune, ST-MoE (Zoph et al., 2022) contributed stabilising design choices (such as a router z-loss) to make sparse expert models stable and transferable.
05Why teams use MoE — and what it costs
The appeal is straightforward: for a given quality target, an MoE model can pretrain faster and infer faster than a dense model of the same total size, because each token only touches a slice of the parameters. GLaM (Du et al., 2021) reported a sparsely-activated MoE language model reaching comparable quality to dense models at far lower training and inference compute and energy. But sparsity is not free.
- Memory: all experts must be held in memory even though only a few run per token — so serving needs enough RAM/VRAM for the total parameters, not just the active ones (per Hugging Face's explainer).
- Routing & communication complexity: sending tokens to the right experts — often across devices — adds engineering and communication overhead.
- Training stability & fine-tuning: sparse models can be harder to train stably and to fine-tune well (ST-MoE, 2022, exists precisely to address this).
- The recurring trap: never quote total parameters as if they were the compute cost. State total vs active explicitly — conflating them is the most common MoE misconception.
Production systems lean on these tradeoffs deliberately. DeepSeek-V2 (2024) combined DeepSeekMoE with latent attention for an economical, efficient model, and DeepSeek-V3 (2024) scaled to a large MoE where most parameters stay inactive per token — the whole point of the architecture, turned into a serving strategy.
06Check your understanding
07Take it with you & go deeper
How large language models work
The bigger picture MoE sits inside — how an LLM turns tokens into predictions.
Read →How transformers work
The architecture whose feed-forward block MoE replaces with experts + a router.
Read →The attention mechanism (deep dive)
The other half of a Transformer block — what stays dense while the FFN goes sparse.
Read →Inference optimization (KV-cache, batching)
How sparse models like MoE are actually served efficiently in production.
Coming soon⊕Concept map
The whole lesson at a glance — expand a branch to see the grounded points underneath it.
What a mixture of experts replaces
- MoE swaps one dense feed-forward block for many expert sub-networks plus a router that picks which experts run.
- The concept dates to Jacobs et al. (1991); the modern sparsely-gated layer is from Shazeer et al. (2017).
- "Experts" are sub-networks within one model — not full, independent models, and not human-named domains.
The key idea: sparse activation
- Total parameters ≠ active parameters: the router selects only the top-k experts per token.
- This conditional computation decouples total capacity from per-token compute cost.
- Mixtral (Jiang et al., 2024) describes 8 experts per layer with the router selecting 2 per token.
The router and how it decides
- A small gating network scores all N experts per token and lights up only its top-k.
- Active params per token stay small even as total params (all experts) grow.
- Naive token-choice routing can pile traffic onto a few experts — "expert collapse" — which balancing addresses.
Routing schemes & milestones
- GShard (2020): top-2 routing with automatic sharding. Switch Transformer (2021): simplified to top-1.
- Load-balancing loss is the standard remedy for collapse; BASE Layers (2021) framed assignment as balanced linear assignment.
- Expert Choice (2022): experts pick their top-k tokens. DeepSeek-V3 (2024): auxiliary-loss-free balancing.
- DeepSeekMoE (2024): fine-grained experts + shared always-on experts; ST-MoE (2022): stabilising design (router z-loss).
Why use MoE — and what it costs
- Upside: faster pretraining and inference for a quality target, since each token touches only a slice (GLaM, Du et al. 2021).
- Memory: all experts must be held in memory even though only a few run per token.
- Complexity: routing/communication overhead and harder training stability and fine-tuning (ST-MoE).
- The recurring trap: never quote total parameters as if they were the compute cost — state total vs active.
Continue your path
Where to go next
You just finished Mixture of Experts (MoE). Here’s a natural progression — from what builds directly on it to where to go deeper.
Tokens, attention, training and inference, in plain language.
Language~7 min
Generative AI
+What you’ll learnHide
How models generate text and images, the key concepts, and real uses.
Open lesson →
Language~7 min
Prompt engineering
+What you’ll learnHide
Techniques, patterns and pitfalls for getting better results from AI.
Open lesson →
Language~9 min
Hallucinations & confabulation
+What you’ll learnHide
Why models make things up, how to spot it, and how to reduce it.
Open lesson →
Language~11 min
Transformers
+What you’ll learnHide
Attention, embeddings and the architecture behind modern AI.
Open lesson →
Language~13 min
The Attention Mechanism (Deep Dive)
+What you’ll learnHide
Continue with The Attention Mechanism (Deep Dive).
Open lesson →Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactive are illustrative and labelled as such. Quantitative claims are attributed to the specific paper that reported them.
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Shazeer et al. (2017)
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding — Lepikhin et al. (2020)
- Switch Transformers: Scaling to Trillion Parameter Models — Fedus, Zoph & Shazeer (2021)
- BASE Layers: Simplifying Training of Large, Sparse Models — Lewis et al. (2021)
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts — Du et al. (2021)
- Mixture-of-Experts with Expert Choice Routing — Zhou et al. (2022)
- ST-MoE: Designing Stable and Transferable Sparse Expert Models — Zoph et al. (2022)
- Mixtral of Experts — Jiang et al., Mistral AI (2024)
- DeepSeekMoE: Towards Ultimate Expert Specialization — Dai et al., DeepSeek-AI (2024)
- DeepSeek-V3 Technical Report — DeepSeek-AI (2024)
- A Survey on Mixture of Experts in Large Language Models — Cai et al. (2024)
- Mixture of Experts Explained — Hugging Face
- What is mixture of experts? — IBM
- What is Mixture of Experts (MoE)? — Red Hat
This is an educational explainer about a model architecture. The router visualizer uses illustrative numbers to convey the idea of sparse activation; it does not measure or benchmark any specific model. Parameter and speed figures mentioned in the text are claims reported by the cited papers and apply to those specific models — do not generalise them across systems.
"Experts" are sub-networks within a single model, not independent or human experts. When evaluating any AI model for real use, verify capabilities, costs, and limitations against the vendor's current documentation rather than architecture intuition alone.
Mixture of Experts (MoE) — in 8 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
What it replaces
MoE swaps one dense feed-forward block for many expert sub-networks plus a small router that picks which experts handle each token. Concept: Jacobs et al. (1991); modern sparsely-gated layer: Shazeer et al. (2017). Experts are sub-networks inside one model, not standalone models.
Sparse activation: total vs active
The router picks only the top-k experts per token, so only a fraction of parameters run per token (conditional computation). This decouples total parameters from per-token compute: capacity grows while per-token FLOPs stay roughly flat. Mixtral (2024): 8 experts/layer, 2 selected per token. Never quote total parameters as the per-token cost.
Routing & load balancing
GShard (2020): top-2 + sharding. Switch (2021): top-1. Naive routing can collapse onto a few experts; fixes include an auxiliary load-balancing loss, BASE Layers' balanced assignment (2021), Expert Choice routing (2022), and DeepSeek-V3's auxiliary-loss-free balancing (2024). DeepSeekMoE (2024) adds shared always-on experts; ST-MoE (2022) stabilises training.
Why use it — and what it costs
Benefit: faster pretraining and inference than a same-size dense model (e.g., GLaM, 2021). Costs: all experts must sit in memory, routing/communication adds complexity, and sparse models can be harder to train stably and fine-tune.