MiniMax Goes Sparse: Decoding M3's Attention from a Single Diagram

On May 26, MiniMax R&D lead Skyler Miao posted a diagram on X — restrained palette, but a lot of information packed in. The title reads MiniMax Sparse Attention, and the two curves on the right give an eye-catching pair of numbers: 9.7× prefill and 15.6× decode speedup at 1M tokens.

The community has near-unanimously read this as the M3 teaser. But the significance reaches well beyond "yet another long-context model."

Back in October, MiniMax published a blog post titled Why Did M2 End Up as a Full Attention Model? The post was unusually direct: M2 didn't inherit M1's Lightning Attention because "efficient attention wasn't production-ready yet." Six months later, M3 surfaces, and the subtext is essentially one sentence — this time it is.

So what exactly does "this time it is" look like? This piece walks through the diagram, then compares it against the three lines DeepSeek has staked out — NSA, DSA, and CSA — to figure out which path MiniMax has chosen.

1. What the diagram actually shows: two stages, pick before you compute

The diagram is essentially the internal unfolding of a single attention block. The move it makes — and the move worth paying attention to — is to split "which KV to look at" and "how to compute attention" into two clearly separated steps.

Step 1: Index Branch — score everything cheaply

The top half is the index branch. It runs independently of the main path with exactly one job: tell the downstream which blocks to look at.

Each GQA group shares one index query (six real heads paired with two Idx Q's in the diagram, one per GQA group). The KV side of the index branch is deliberately reduced in dimension:

Note that K_idx has only one head — all heads share the same index key. As a result, Q_idx · K_idxᵀ costs almost nothing to compute.

Block Max Pool then compresses the token-level scores into block-level scores:

Finally, TopK decides which KV blocks to keep for this layer and this GQA group; the result is I₁, I₂.

Step 2: Sparse Branch — where attention actually runs

The bottom half is where the real attention computation happens. Q ∈ ℝⁿˣᴴˣᵈ, K, V ∈ ℝⁿˣʰˣᵈ, still in standard GQA form. Using I₁, I₂ from Step 1 as indices, we pull the corresponding block subsets out of the original K/V and run:

One key design choice: query heads within the same GQA group share a single top-k selection. In the diagram, Q1/Q2/Q3 all use I₁, Q4/Q5/Q6 all use I₂. This is the hardware-aligned principle the NSA paper hammers on — one group of queries loads one set of KV blocks, fits into SRAM in a single pass, and FlashAttention-style kernels can be reused unchanged.

2. Three deliberate subtractions relative to the DeepSeek family

The community immediately put this design side by side with DeepSeek's NSA / DSA / CSA. @eliebakouch's summary fits in one line: "GQA not MLA, block-level selection like CSA but attention is computed on the real K/V." Expanded into a table:

Dimension	DeepSeek V3.2 DSA	DeepSeek NSA	DeepSeek V4 CSA	MiniMax M3 (inferred)
KV substrate	MLA (latent)	GQA	MLA	GQA
Selection granularity	token-level	block-level	block-level	block-level
Parallel branches	1 (indexer + select)	3 (compress + select + sliding)	1	1 (select only)
Where attention runs	real K/V	three-way fusion	compressed KV	real K/V
Indexer cost	Lightning indexer	compression branch	block summaries	single-head K + Block Max Pool
Gating	none	learned gate	none	none

Three trade-offs come into focus:

First subtraction: GQA as the substrate, not MLA. This means vLLM, SGLang, and FlashAttention kernels can be reused with little to no modification — none of the engineering required to work around MLA's latent KV. For a lab aiming at "production-ready," it is the lowest-risk path.

Second subtraction: block-level selection, but attention computed on the real K/V. Unlike CSA, which runs attention on compressed KV, M3 keeps the full expressive power of softmax attention. The cost is that the KV cache doesn't shrink along with attention sparsification — but trading token economy for quality is a sensible bargain.

Third subtraction: NSA's other two branches are gone. NSA originally has three parallel paths (compress + select + sliding window) plus a learned gate. M3 keeps only selection. @teortaxesTex described it succinctly — streamlined, simplified NSA. In a sentence: engineering first.

Of the two branches that got cut, the sliding window is most likely replaced by RoPE + attention sink, or simply by dense attention as a per-layer fallback (Gemma 3 and Qwen3-Next both do this). The compression branch is absorbed into the minimal "single-head K + Block Max Pool."

3. How to read the numbers

Stage	Speedup @ 1M	What it means
Prefill	9.7×	Process 1M tokens of input in one pass
Decode	15.6×	Generate token by token

Decode speedup exceeding prefill is reasonable. During prefill, the index branch still has to scan the full length, so the saving is only on the main attention. During decode, each query interacts only with the selected KV blocks, and the memory-bandwidth pressure on the KV cache drops by roughly an order of magnitude.

Backing out the selection ratio: assume block size = 64, so 1M tokens corresponds to ~16k blocks. A 15.6× decode speedup implies each query actually touches only about 6–7% of the blocks, giving an effective receptive field around 60k–70k tokens. That ratio sits almost exactly on top of the sparsity rate the NSA paper reports (6–10%) — not a coincidence, but the sweet spot of this kind of design at the 1M scale.

4. Inferring the rest of M3

Extrapolating from this attention block to the full model:

The MoE backbone likely stays. M2 shipped as 230B total / ~10B active / Top-2 routing / hidden dim ~4096; M2.7 has already pushed expert count to 256. There is no reason for M3 to abandon this, so the most likely change is going deeper and wider.

The full attention stack gets replaced with block-sparse GQA. M1's Lightning Attention is unlikely to make a return — M3 is not betting on linear attention again, but is taking the "softmax expressiveness + top-k block selection" route, achieving sub-quadratic complexity while preserving quality.

Most likely natively trained sparsity. This is the central message of the NSA paper — the sparse pattern must enter gradients during pretraining, or retrieval heads get scrambled. MiniMax has its own research line on retrieval heads, so they shouldn't fall into this trap.

The battleground is 1M+ context. M1 was trained at 1M and extrapolates to 4M at inference; M3 is locking that in and slashing inference cost — a very natural product cadence.

5. Placing M3 in the 2026 design space

Across 2025–2026, sparse-attention designs have diverged quickly:

DeepSeek V3.2 DSA: MLA + token-level top-k, very lightweight indexer, most stable quality but heavy kernel engineering
DeepSeek NSA: GQA, three branches + gate, highest quality ceiling but complex implementation
Qwen3-Next: layer-wise mix, dense / linear alternation, robust but relatively conservative
MiniMax M3: GQA + single-branch block selection, minimal, riding the hardware tailwind

The subtext of M3's design is unambiguous — "don't chase the theoretically optimal attention; chase the one that runs immediately, runs fast, and lets existing kernels be reused." It's of a piece with their decision to fall back to full attention in M2: stabilize quality with mainstream methods first, then replace cleanly once the technology is genuinely mature.

Closing thoughts

Plenty of detail can't be confirmed from a single diagram: whether the sparse pattern is layer-wise mixed, whether there is a dense fallback, whether the index branch shares embeddings with the main network, whether training-time top-k is hard or soft, how the index branch's loss is formulated… All of this has to wait for the official paper or the weight release.

But one thing is already settled: following DeepSeek, another Chinese lab has put together "sparse attention + long context + open weights" as a working stack. In the second half of 2026, 1M context in the open-source space is likely to shift from a selling point to a baseline — and that, on its own, matters more than any single benchmark.

References

Skyler Miao (MiniMax R&D lead), original tweet: Something BIG is coming
Community roundup: MiniMax details its M3 sparse attention architecture
MiniMax blog: Why Did M2 End Up as a Full Attention Model?
DeepSeek NSA paper: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
DeepSeek V3.2 DSA write-up: Architectural Efficiency in LLMs: DeepSeek-V3.2-Exp and DSA
Sebastian Raschka: A Technical Tour of the DeepSeek Models from V3 to V3.2
MiniMax-01 tech report: Scaling Foundation Models with Lightning Attention