How attention actually works
Transformers are built from one core operation: attention. Forget the full architecture for a moment and zoom in on the calculation itself — how a model turns words into queries, keys, and values, scores them against each other, and decides which words to "look at." Then watch a query token attend to a sentence, live, on this page.
01The problem attention was built to solve
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
Before attention, sequence models read a sentence one word at a time and squeezed everything they had read into a single fixed-length summary vector. For a long sentence, that bottleneck threw away detail — the model had to remember the whole input through one cramped representation. Bahdanau, Cho & Bengio (2014) introduced attention to fix exactly this: instead of relying on one summary, the model is allowed to look back at every input word and decide, for each step, which words matter most right now. They called the learned weighting a soft alignment between output and input.
The Transformer (Vaswani et al., 2017) took this idea to its logical extreme. Its paper title — "Attention Is All You Need" — says it plainly: it threw out recurrence and convolution entirely and built the whole model out of attention. To understand a Transformer, you really only need to understand this one operation well.
- The bottleneck: older models compressed a whole sentence into one fixed vector, losing detail on long inputs.
- The fix: attention lets the model look back at all words and weight them per step — a learned "soft alignment" (Bahdanau et al., 2014).
- The leap: the Transformer is built almost entirely from attention, dispensing with recurrence and convolution (Vaswani et al., 2017).
02Queries, keys & values — the three roles of a word
Attention reframes each word as playing three roles at once. A useful mental model is a library lookup. Every word produces a query ("what am I looking for?"), a key ("what do I offer as a match?"), and a value ("what information do I carry?"). All three are just linear projections of the word's representation — the same vector pushed through three different learned weight matrices (Vaswani et al., 2017; Jurafsky & Martin, SLP3 ch. 9).
To compute attention for one word, you take its query and compare it against every word's key. A close match means "this word is relevant to me." Those match scores become weights, and the output is a weighted sum of the value vectors — you pull in more of the values from words you matched strongly, and less from the rest. That weighted blend, not any single word, is what attention produces.
- Query (Q): what this word is looking for in the others.
- Key (K): what each word advertises, to be matched against queries.
- Value (V): the actual information a word contributes once it is selected.
- Q, K, and V are three learned linear projections of the same input — and the output is a weighted sum of values, weighted by query-key match.
03Scaled dot-product attention, step by step
Here is the whole operation in one line, exactly as the Transformer defines it (Vaswani et al., 2017):
Read it left to right. Q Kₜ takes the dot product of each query with every key — a raw compatibility score: bigger means more aligned. Those scores are then divided by √dₖ (the square root of the key dimension). This scaling matters: with large key dimensions the dot products grow large, which pushes the softmax into regions where its gradients are tiny; dividing by √dₖ keeps them in a healthy range. softmax turns the scaled scores into a set of positive weights that sum to one — the attention distribution. Finally, multiplying by V produces the weighted sum of value vectors.
Earlier work explored other ways to score query-key compatibility — Bahdanau et al. (2014) used an additive (concatenation) score, and Luong et al. (2015) compared multiplicative (dot-product) scoring plus global vs. local attention. The Transformer settled on scaled dot-product attention because it is fast and parallelizable as plain matrix multiplication.
- Score: dot-product of each query with every key (
Q Kₜ). - Scale: divide by
√dₖto keep softmax gradients stable at large dimensions. - Normalize: softmax turns scores into weights summing to 1.
- Blend: multiply weights by the value vectors to get the output.
04Self-attention, masking & why position has to be added
Self-attention is the case where Q, K, and V all come from the same sequence — every position attends to every other position in the same sentence, letting a word gather context from its neighbours and beyond. Cross-attention instead lets one sequence attend to another, e.g. a decoder attending to an encoder's output during translation (Vaswani et al., 2017; Hugging Face docs). When a model generates text left to right, it uses masked (causal) attention so a position can only attend to earlier positions, never to words it has not produced yet.
One subtle but important consequence: attention is permutation-invariant over the set of tokens. The operation itself has no notion of word order — shuffle the inputs and the math does not change. So position must be added explicitly. The original Transformer used fixed sinusoidal positional encodings; later work added relative position representations (Shaw et al., 2018) and rotary embeddings / RoPE (Su et al., 2021), which encode relative position by rotating the query and key vectors.
- Self vs cross: self-attention stays within one sequence; cross-attention bridges two (e.g. decoder → encoder).
- Causal masking: in generation, a token may only attend to earlier tokens.
- Position is bolted on: attention ignores order, so positional information is added — sinusoidal, relative, or rotary (RoPE).
05Multi-head attention — and a live look inside
A single attention calculation can only capture one kind of relationship at a time. Multi-head attention runs several attention operations in parallel, each on its own lower-dimensional projection of Q, K, and V, then concatenates and re-projects the results (Vaswani et al., 2017). The point is specialization: different heads can learn to attend to different kinds of relations — one might track which verb a subject belongs to, another which noun a pronoun refers to. No head is told what to look for; the roles emerge from training.
The interactive below makes this concrete. Pick a query word in the sentence and the chart shows the attention weights that word places on every other word — computed live from toy query and key vectors with the real scaled-dot-product-then-softmax pipeline. Switch heads to see how different heads concentrate on different words. The numbers here are illustrative, designed to show the shape of attention; they are not from a trained model.
- Many heads, one layer: several attention operations run in parallel on separate projections, then are concatenated and re-projected.
- Specialization emerges: different heads learn to attend to different relations — this is learned, not hand-designed.
- Each head still uses the same
softmax(QKₜ/√dₖ)Vmath — multi-head just runs it several times in different subspaces.
06The catch: attention scales quadratically
Because every token attends to every other token, standard self-attention's time and memory cost grows quadratically with sequence length — double the context and you roughly quadruple the work. This is the main bottleneck for long inputs (Dao et al., 2022; Beltagy et al., 2020), and a great deal of research targets it. The variants split into two camps that you should not conflate:
Exact, just faster — FlashAttention
FlashAttention is an IO-aware exact attention algorithm: it computes the same result but uses tiling to cut the number of memory reads and writes between the GPU's high-bandwidth memory and its on-chip SRAM. FlashAttention-2 improves parallelism and work partitioning for further speedups (Dao et al., 2022; Dao, 2023). Crucially, the output is mathematically the same — only the memory access pattern changes.
Share key/value heads — MQA & GQA
During generation, the main cost is moving the cached keys and values for each head. Multi-query attention (MQA) shares a single key/value head across all query heads to cut that bandwidth (Shazeer, 2019); grouped-query attention (GQA) uses an intermediate number of key/value groups, trading off between full multi-head quality and MQA speed (Ainslie et al., 2023).
Approximate the attention matrix
These methods change the math to dodge the quadratic cost, accepting some approximation. Longformer uses local windowed plus global sparse attention (linear scaling); Reformer uses locality-sensitive hashing (O(L log L)) and reversible layers; Linformer projects to a low rank (O(n)); Performer approximates softmax attention with FAVOR+ random features (linear). Unlike FlashAttention, these are approximate and may trade model quality for speed.
07Check your understanding
08Take it with you & go deeper
How transformers work
Zoom back out: how attention blocks stack into the full Transformer architecture.
Read →Tokens & context windows
Why the quadratic cost of attention puts a ceiling on context length.
Read →Mixture of Experts (MoE)
Another way to scale transformers — routing tokens to specialized sub-networks.
Coming soonDecoding & sampling
Once attention has produced its output, how the model actually picks the next token.
Coming soon⊕Concept map
The whole lesson at a glance — expand a branch to see the grounded points underneath it.
The problem attention solves
- Older models compressed a whole sentence into one fixed-length vector, losing detail on long inputs.
- Attention lets the model look back at all words and weight them per step — a learned "soft alignment" (Bahdanau et al., 2014).
- The Transformer (Vaswani et al., 2017) built the whole model from attention, dropping recurrence and convolution.
Queries, keys & values
- Query: what a word is looking for. Key: what each word advertises. Value: the information it contributes.
- Q, K, and V are three learned linear projections of the same input representation.
- The output is a weighted sum of value vectors, weighted by query-key match.
Scaled dot-product attention
- Score: dot-product of each query with every key (
QKₜ). - Scale: divide by
√dₖto keep softmax gradients stable at large key dimensions. - Normalize then blend: softmax turns scores into weights summing to 1, then multiply by V.
Self-attention, masking & position
- Self vs cross: self-attention stays within one sequence; cross-attention bridges two (decoder → encoder).
- Causal masking: in generation a token may only attend to earlier tokens.
- Attention is permutation-invariant, so position is added explicitly — sinusoidal, relative, or rotary (RoPE).
Multi-head attention
- Several attention operations run in parallel on separate projections, then are concatenated and re-projected.
- Different heads learn to attend to different relations — this specialization is learned, not hand-designed.
- Each head still uses the same
softmax(QKₜ/√dₖ)Vmath, just in a different subspace.
The catch: quadratic scaling
- Cost grows quadratically with sequence length — the main bottleneck for long inputs.
- Exact, just faster: FlashAttention / FlashAttention-2 reorganize memory access for identical math.
- Fewer KV heads: MQA (Shazeer 2019) and GQA (Ainslie et al. 2023) cut decode-time memory traffic.
- Approximate: Longformer, Reformer, Linformer, Performer trade some quality for sub-quadratic cost.
→Related lessons
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the numbers shown in the interactive are illustrative and labelled as such.
- Attention Is All You Need — Vaswani et al. (2017)
- Neural Machine Translation by Jointly Learning to Align and Translate — Bahdanau, Cho & Bengio (2014)
- Effective Approaches to Attention-based Neural Machine Translation — Luong, Pham & Manning (2015)
- Speech and Language Processing (3rd ed.), Ch. 9 — The Transformer — Jurafsky & Martin
- FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al. (2022)
- FlashAttention-2: Faster Attention with Better Parallelism — Dao (2023)
- Fast Transformer Decoding: One Write-Head is All You Need (MQA) — Shazeer (2019)
- GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al. (2023)
- Self-Attention with Relative Position Representations — Shaw, Uszkoreit & Vaswani (2018)
- RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) — Su et al. (2021)
- The Annotated Transformer — Harvard NLP
- The Illustrated Transformer — Jay Alammar
This is an educational explainer about how the attention mechanism works. The query/key/value vectors and attention weights in the interactive are illustrative toy numbers chosen to show the shape of attention — they are not from a trained model, and real attention patterns depend on the model, data, and task.
AI systems can produce plausible-sounding but incorrect output. For decisions that carry real consequences, verify against primary sources and consult a qualified professional. See the NIST AI Risk Management Framework for guidance on responsible AI.
The attention mechanism — in one page
Tech Jacks Solutions · AI Knowledge Hub · educational summary
Why attention exists
Earlier models squeezed a whole sentence into one fixed vector, losing detail. Attention (Bahdanau et al., 2014) lets the model look back at every input word and weight them per step. The Transformer (Vaswani et al., 2017) built the whole model from attention, dropping recurrence and convolution.
Queries, keys & values
Each word becomes three learned linear projections: a query (what I want), a key (what I offer), and a value (what I carry). The output is a weighted sum of values, weighted by how well each query matches each key.
Scaled dot-product attention
Attention(Q,K,V) = softmax(QKₜ / √dₖ) V. Score with the dot product, scale by √dₖ to keep softmax gradients stable, softmax to weights summing to 1, then blend the values.
Self-attention, masking & position
Self-attention: Q/K/V from one sequence; cross-attention bridges two. Causal masking limits a token to earlier positions. Attention is permutation-invariant, so position is added explicitly (sinusoidal, relative, or rotary/RoPE).
Multi-head & scaling
Multi-head runs several attention heads in parallel on projected subspaces, then concatenates and re-projects; heads can specialize on different relations. Standard attention costs grow quadratically with length; FlashAttention is exact-but-faster, MQA/GQA share key/value heads, and Longformer/Reformer/Linformer/Performer are approximate variants.