DeepSeek

DeepSeek V4 Architecture Decoded: Hybrid Attention, MoE, and mHC (2026)

Go Deeper

Resources from across Tech Jacks Solutions

FREEAI Risk Management Template

Identify, assess, and mitigate AI deployment risks

EU AI Act Guide

Check your compliance obligations under the EU AI Act

FREEAI Bias Assessment

Evaluate bias risks before deploying any AI system

What Is Agentic AI?

Understand the architecture behind autonomous AI agents

AI Career Paths

Explore roles that work with these tools daily

1.6T

V4-Pro total parameters, with 49B active per token

Native context window on both V4-Pro and V4-Flash, up to 384K output tokens

10%

V4-Pro KV cache at 1M context vs V3.2, as reported by DeepSeek

MIT

Open-weight license on the V4 preview, released April 24, 2026

On April 24, 2026, DeepSeek released a preview of V4, its first model family built around a hybrid attention design rather than the Multi-head Latent Attention that defined V2 and V3. The headline number is a 1 million token context window that runs at a fraction of the compute a conventional transformer would need. s1

This breakdown decodes the V4 architecture: the two model sizes, the interleaved attention scheme that makes long context affordable, the manifold-constrained connections that replace the residual stream, and the training recipe. It also marks clearly what DeepSeek has disclosed versus what has been analyzed by third parties, and what remains unknown.

DeepSeek V4 at a Glance

V4 is a preview release, not a final general-availability model. DeepSeek published it with open weights under the MIT License, alongside a technical report and Hugging Face model cards for two variants: V4-Pro and V4-Flash. s1

Both variants are Mixture-of-Experts (MoE) transformers that activate only a small fraction of their total parameters on each token. Both support a 1 million token context natively, with output of up to 384K tokens. The defining change from prior DeepSeek generations is the attention stack, which the company built specifically so that a million-token window stays tractable to serve. s1

Preview status: V4 is a preview. DeepSeek states the architecture may be refined before a final release, so specific figures here should be read as preview-stage values rather than locked specifications.

From MLA to V4: The DeepSeek Lineage

V4 did not appear in isolation. Each prior DeepSeek generation contributed a building block that V4 carries forward or replaces. The clearest way to read the architecture is as the latest step in a multi-year line of efficiency-focused designs.

DeepSeek Architecture Lineage

MAY 2024

DeepSeek V2: Multi-head Latent Attention

V2 introduced Multi-head Latent Attention (MLA), compressing the key-value cache into a low-rank latent to cut memory at long context. MLA became the signature efficiency lever of the V-series.

DECEMBER 2024

DeepSeek V3: Scaled MoE

V3 paired MLA with a large Mixture-of-Experts backbone and was trained for a disclosed $5.576M across 2.788M H800 GPU-hours, demonstrating frontier results at low training cost.

JANUARY 2025

DeepSeek R1: Reasoning

R1 added reinforcement-learning reasoning on top of the V3 base, producing a visible chain-of-thought and bringing wide public attention to the lineage.

DECEMBER 2025

DeepSeek V3.2: Sparse Attention

V3.2 introduced DeepSeek Sparse Attention (DSA), the immediate predecessor to V4's Compressed Sparse Attention and the baseline against which V4's efficiency gains are measured.

APRIL 24, 2026

DeepSeek V4 Preview: Hybrid Attention

V4-Pro (1.6T total, 49B active) and V4-Flash (284B total, 13B active) launch as a preview. Hybrid attention plus manifold-constrained connections replace MLA and the residual stream. MIT License.

Two Models: V4-Pro and V4-Flash

V4 ships in two sizes that share the same architectural ideas but differ in scale and in how their early layers are arranged. Both are MoE models, so the active parameter count, not the total, governs per-token inference cost. s1

Specification	V4-Pro	V4-Flash
Total parameters	1.6T	284B
Active parameters per token	49B	13B
Layers	61	43
Pretraining tokens	33T+	32T+
Native context window	1M tokens	1M tokens
Max output	up to 384K tokens	up to 384K tokens
First two layers	HCA, then alternate	Sliding-window, then alternate
License	MIT (open weights)	MIT (open weights)

Expert count not disclosed: DeepSeek has not published the exact number of MoE experts for either V4 variant in the available sources. The active parameter figures (49B for Pro, 13B for Flash) are disclosed; the expert routing topology that produces them is not fully specified.

Hybrid Attention: CSA Interleaved with HCA

The core of V4 is a two-branch attention design. Rather than apply one attention mechanism uniformly, V4 interleaves two complementary mechanisms across its layers, and adds a short uncompressed window for recent tokens. s1

CSA: Compressed Sparse Attention

Compressed Sparse Attention (CSA) inherits the sparse-selection idea from V3.2's DeepSeek Sparse Attention. As analyzed by Hugging Face and independent reviewers, CSA applies roughly 4x compression through softmax-gated pooling, then uses an FP4 "lightning indexer" to select the top-k most relevant blocks for each query. Attention is computed only over that selected subset rather than the full sequence. s3

HCA: Heavily Compressed Attention

Heavily Compressed Attention (HCA) takes compression much further, roughly 128x, then runs dense attention over the resulting compressed blocks. As analyzed by independent deep-dives, this gives the model a cheap, global view of the entire context, complementing CSA's sharper, selective view. s5

Sliding Window and Attention Sinks

Both branches keep a 128-token sliding window of recent, uncompressed tokens so the model never loses fidelity on the most recent text. V4 also uses learnable attention sinks: sink logits that let the softmax sum to less than one, which stabilizes attention over very long sequences. s5

How the Layers Interleave

The two model sizes arrange their opening layers differently, then settle into the same alternating pattern: s3

V4-Flash: the first two layers use the sliding-window mechanism, then the remaining layers alternate between CSA and HCA.
V4-Pro: the first two layers use HCA, then the remaining layers alternate between CSA and HCA.

Vendor report vs analysis: The model sizes, parameter counts, layer counts, and context window come from DeepSeek's own technical report and model cards. The detailed mechanics of CSA and HCA, including the compression ratios and the indexer behavior, are partly reconstructed by third-party analysts such as Hugging Face, NVIDIA, and independent reviewers, and are labeled as analysis above.

mHC: Replacing the Residual Stream

Most transformers move information between layers through a residual stream: each layer adds its output back to a running sum. V4 replaces that with Manifold-Constrained Hyper-Connections (mHC), a learned connection scheme designed to keep very deep stacks stable. s5

Two ideas define mHC. First, it widens the inter-layer pathway by a factor of n_hc=4, giving the network more than a single channel to carry information between layers. Second, the matrix that mixes those channels is constrained to the Birkhoff polytope, meaning it is kept doubly stochastic (every row and column sums to one). s5

The practical payoff of that constraint is numerical: a doubly stochastic mixing matrix bounds its spectral norm at one, which prevents signals from exploding or vanishing as they pass through dozens of layers. For a 61-layer model like V4-Pro, that bound is what keeps a much deeper, wider network trainable. s5

Optimizer and Precision: Muon and FP4 QAT

V4's training recipe is as much a part of its efficiency story as its attention. Two choices stand out: the optimizer and the numerical precision used for the experts. s5

Muon for Most Parameters, AdamW for Embeddings

DeepSeek trains most of V4 with the Muon optimizer, which applies Newton-Schulz orthogonalization to the weight updates so that gradient steps stay well-conditioned. Embeddings, which behave differently, are trained with the more conventional AdamW. Using Muon for the bulk of the network and AdamW for embeddings is a deliberate split rather than a single global choice. s5

FP4 Quantization-Aware Training

V4 applies FP4 Quantization-Aware Training (QAT) to its MoE expert weights and to the CSA indexer, with FP8 used elsewhere. Training the experts directly in a 4-bit aware regime, rather than quantizing only after the fact, keeps the very large expert pool affordable to store and serve while limiting the accuracy loss that naive post-training quantization would cause. s5

Efficiency at 1M Context

The point of the hybrid attention stack is measurable: at a 1 million token context, V4 uses a small fraction of the compute and memory that DeepSeek's own V3.2 would need for the same window. DeepSeek reports the following comparisons against V3.2. s1

V4 Efficiency vs V3.2 at 1M Context

Lower is better. Figures reported by DeepSeek, relative to V3.2 at the same context length | source: s1

Single-token inference FLOPs (share of V3.2)

V4-Flash 10%

V4-Pro 27%

KV cache (share of V3.2)

V4-Flash 7%

V4-Pro 10%

Set against a conventional grouped-query attention baseline in bf16, the gap is even wider: DeepSeek reports that V4's KV cache is roughly 2 percent of the size a GQA-bf16 model would carry at the same context. That reduction is what turns a 1 million token window from a research demonstration into something practical to serve. s1

What Is Not Disclosed

A responsible reading of the V4 architecture means being explicit about its gaps. Two figures that readers often expect are not available, and the preview status itself is a caveat.

Disclosure Gaps and Caveats

PREVIEW

V4 Is a Preview, Not GA

DeepSeek released V4 as a preview on April 24, 2026, and notes the architecture may be refined. Treat parameter counts, layer counts, and efficiency figures as preview-stage values that could change before a final release.

UNDISCLOSED

Exact Expert Count Unknown

The precise number of MoE experts for V4-Pro and V4-Flash is not stated in the available sources. Active parameter counts are disclosed, but the full expert routing topology is not.

UNDISCLOSED

V4 Training Cost Not Released

DeepSeek has not disclosed V4's total training cost or compute. Only the prior V3 figure of roughly $5.576M across 2.788M H800 GPU-hours is public. Any V4 training-cost number would be inference, not fact.

ANALYSIS

Some Mechanics Are Third-Party

CSA and HCA internal mechanics, including compression ratios and indexer behavior, are partly reconstructed by Hugging Face, NVIDIA, and independent analysts. Where this guide cites those details, it labels them as analysis rather than official specification.

Frequently Asked Questions

DeepSeek V4 Architecture FAQ

V4-Pro has 1.6 trillion total parameters with 49 billion active per token across 61 layers, trained on 33 trillion-plus tokens. V4-Flash has 284 billion total parameters with 13 billion active per token across 43 layers, trained on 32 trillion-plus tokens. Both support a 1 million token native context and up to 384K output tokens. Pro begins with two HCA layers; Flash begins with two sliding-window layers. After the opening layers, both alternate CSA and HCA.

V4 interleaves two attention mechanisms. Compressed Sparse Attention (CSA) applies roughly 4x compression via softmax-gated pooling, then an FP4 lightning indexer selects the top-k most relevant blocks per query. Heavily Compressed Attention (HCA) applies roughly 128x compression and runs dense attention over the compressed blocks for a cheap global view. A 128-token sliding window preserves recent tokens, and learnable attention sinks stabilize long-sequence attention. The CSA and HCA mechanics are partly described by third-party analysis rather than only the official report.

Manifold-Constrained Hyper-Connections (mHC) replace the standard residual stream that carries information between layers. mHC widens the inter-layer pathway by a factor of four and constrains the channel-mixing matrix to the Birkhoff polytope, keeping it doubly stochastic. A doubly stochastic matrix bounds its spectral norm at one, which prevents signals from exploding or vanishing across a deep stack. This is what keeps a deeper, wider network like V4-Pro stable to train.

DeepSeek has not disclosed V4's total training cost or compute in the available sources. The only public DeepSeek training-cost figure is for the prior V3 model, which DeepSeek reported at roughly $5.576 million across 2.788 million H800 GPU-hours. Any specific dollar figure for V4 training would be an inference rather than a disclosed fact.

At a 1 million token context, DeepSeek reports that V4-Pro uses about 27 percent of the single-token inference FLOPs and 10 percent of the KV cache of V3.2, while V4-Flash uses about 10 percent of the FLOPs and 7 percent of the KV cache. Against a grouped-query attention baseline in bf16, DeepSeek reports V4's KV cache is roughly 2 percent of that baseline's size.

No. V4 was released on April 24, 2026, as a preview with open weights under the MIT License. DeepSeek indicates the architecture may be refined before a final release, so the specifications described here, including parameter counts and efficiency figures, should be read as preview-stage values that could change.

Video Resources

▶

DeepSeek V4 Architecture Explained

YouTube Search

▶

DeepSeek V4 Hybrid Attention and MoE

YouTube Search

▶

DeepSeek V4 1M Token Context Deep Dive

YouTube Search

Gallery

Contacts

DeepSeek V4 Architecture Decoded: Hybrid Attention, MoE, and mHC (2026)

Go Deeper

DeepSeek V4 at a Glance

From MLA to V4: The DeepSeek Lineage

Two Models: V4-Pro and V4-Flash

Hybrid Attention: CSA Interleaved with HCA

CSA: Compressed Sparse Attention

HCA: Heavily Compressed Attention

Sliding Window and Attention Sinks

How the Layers Interleave

mHC: Replacing the Residual Stream

Optimizer and Precision: Muon and FP4 QAT

Muon for Most Parameters, AdamW for Embeddings

FP4 Quantization-Aware Training

Efficiency at 1M Context

What Is Not Disclosed

Frequently Asked Questions

Services

Learn

Company