DeepSeek V4 Architecture Decoded: Hybrid Attention, MoE, and mHC (2026)
Go Deeper
Resources from across Tech Jacks Solutions
FREEAI Risk Management Template
Identify, assess, and mitigate AI deployment risks
EU AI Act Guide
Check your compliance obligations under the EU AI Act
FREEAI Bias Assessment
Evaluate bias risks before deploying any AI system
What Is Agentic AI?
Understand the architecture behind autonomous AI agents
AI Career Paths
Explore roles that work with these tools daily
On April 24, 2026, DeepSeek released a preview of V4, its first model family built around a hybrid attention design rather than the Multi-head Latent Attention that defined V2 and V3. The headline number is a 1 million token context window that runs at a fraction of the compute a conventional transformer would need. s1
This breakdown decodes the V4 architecture: the two model sizes, the interleaved attention scheme that makes long context affordable, the manifold-constrained connections that replace the residual stream, and the training recipe. It also marks clearly what DeepSeek has disclosed versus what has been analyzed by third parties, and what remains unknown.
DeepSeek V4 at a Glance
V4 is a preview release, not a final general-availability model. DeepSeek published it with open weights under the MIT License, alongside a technical report and Hugging Face model cards for two variants: V4-Pro and V4-Flash. s1
Both variants are Mixture-of-Experts (MoE) transformers that activate only a small fraction of their total parameters on each token. Both support a 1 million token context natively, with output of up to 384K tokens. The defining change from prior DeepSeek generations is the attention stack, which the company built specifically so that a million-token window stays tractable to serve. s1
From MLA to V4: The DeepSeek Lineage
V4 did not appear in isolation. Each prior DeepSeek generation contributed a building block that V4 carries forward or replaces. The clearest way to read the architecture is as the latest step in a multi-year line of efficiency-focused designs.
Two Models: V4-Pro and V4-Flash
V4 ships in two sizes that share the same architectural ideas but differ in scale and in how their early layers are arranged. Both are MoE models, so the active parameter count, not the total, governs per-token inference cost. s1
| Specification | V4-Pro | V4-Flash |
|---|---|---|
| Total parameters | 1.6T | 284B |
| Active parameters per token | 49B | 13B |
| Layers | 61 | 43 |
| Pretraining tokens | 33T+ | 32T+ |
| Native context window | 1M tokens | 1M tokens |
| Max output | up to 384K tokens | up to 384K tokens |
| First two layers | HCA, then alternate | Sliding-window, then alternate |
| License | MIT (open weights) | MIT (open weights) |
Hybrid Attention: CSA Interleaved with HCA
The core of V4 is a two-branch attention design. Rather than apply one attention mechanism uniformly, V4 interleaves two complementary mechanisms across its layers, and adds a short uncompressed window for recent tokens. s1
CSA: Compressed Sparse Attention
Compressed Sparse Attention (CSA) inherits the sparse-selection idea from V3.2's DeepSeek Sparse Attention. As analyzed by Hugging Face and independent reviewers, CSA applies roughly 4x compression through softmax-gated pooling, then uses an FP4 "lightning indexer" to select the top-k most relevant blocks for each query. Attention is computed only over that selected subset rather than the full sequence. s3
HCA: Heavily Compressed Attention
Heavily Compressed Attention (HCA) takes compression much further, roughly 128x, then runs dense attention over the resulting compressed blocks. As analyzed by independent deep-dives, this gives the model a cheap, global view of the entire context, complementing CSA's sharper, selective view. s5
Sliding Window and Attention Sinks
Both branches keep a 128-token sliding window of recent, uncompressed tokens so the model never loses fidelity on the most recent text. V4 also uses learnable attention sinks: sink logits that let the softmax sum to less than one, which stabilizes attention over very long sequences. s5
How the Layers Interleave
The two model sizes arrange their opening layers differently, then settle into the same alternating pattern: s3
- V4-Flash: the first two layers use the sliding-window mechanism, then the remaining layers alternate between CSA and HCA.
- V4-Pro: the first two layers use HCA, then the remaining layers alternate between CSA and HCA.
mHC: Replacing the Residual Stream
Most transformers move information between layers through a residual stream: each layer adds its output back to a running sum. V4 replaces that with Manifold-Constrained Hyper-Connections (mHC), a learned connection scheme designed to keep very deep stacks stable. s5
Two ideas define mHC. First, it widens the inter-layer pathway by a factor of n_hc=4, giving the network more than a single channel to carry information between layers. Second, the matrix that mixes those channels is constrained to the Birkhoff polytope, meaning it is kept doubly stochastic (every row and column sums to one). s5
The practical payoff of that constraint is numerical: a doubly stochastic mixing matrix bounds its spectral norm at one, which prevents signals from exploding or vanishing as they pass through dozens of layers. For a 61-layer model like V4-Pro, that bound is what keeps a much deeper, wider network trainable. s5
Optimizer and Precision: Muon and FP4 QAT
V4's training recipe is as much a part of its efficiency story as its attention. Two choices stand out: the optimizer and the numerical precision used for the experts. s5
Muon for Most Parameters, AdamW for Embeddings
DeepSeek trains most of V4 with the Muon optimizer, which applies Newton-Schulz orthogonalization to the weight updates so that gradient steps stay well-conditioned. Embeddings, which behave differently, are trained with the more conventional AdamW. Using Muon for the bulk of the network and AdamW for embeddings is a deliberate split rather than a single global choice. s5
FP4 Quantization-Aware Training
V4 applies FP4 Quantization-Aware Training (QAT) to its MoE expert weights and to the CSA indexer, with FP8 used elsewhere. Training the experts directly in a 4-bit aware regime, rather than quantizing only after the fact, keeps the very large expert pool affordable to store and serve while limiting the accuracy loss that naive post-training quantization would cause. s5
Efficiency at 1M Context
The point of the hybrid attention stack is measurable: at a 1 million token context, V4 uses a small fraction of the compute and memory that DeepSeek's own V3.2 would need for the same window. DeepSeek reports the following comparisons against V3.2. s1
Set against a conventional grouped-query attention baseline in bf16, the gap is even wider: DeepSeek reports that V4's KV cache is roughly 2 percent of the size a GQA-bf16 model would carry at the same context. That reduction is what turns a 1 million token window from a research demonstration into something practical to serve. s1
What Is Not Disclosed
A responsible reading of the V4 architecture means being explicit about its gaps. Two figures that readers often expect are not available, and the preview status itself is a caveat.
Frequently Asked Questions
DeepSeek and the DeepSeek logo are trademarks of their respective owner. This article is independent editorial content from Tech Jacks Solutions and is not affiliated with, endorsed by, or sponsored by DeepSeek.