Language lesson

Track 02 · Language Advanced ~9 min

How attention actually works

Transformers are built from one core operation: attention. Forget the full architecture for a moment and zoom in on the calculation itself — how a model turns words into queries, keys, and values, scores them against each other, and decides which words to "look at." Then watch a query token attend to a sentence, live, on this page.

Module progress

01The problem attention was built to solve

Before attention, sequence models read a sentence one word at a time and squeezed everything they had read into a single fixed-length summary vector. For a long sentence, that bottleneck threw away detail — the model had to remember the whole input through one cramped representation. Bahdanau, Cho & Bengio (2014) introduced attention to fix exactly this: instead of relying on one summary, the model is allowed to look back at every input word and decide, for each step, which words matter most right now. They called the learned weighting a soft alignment between output and input.

The Transformer (Vaswani et al., 2017) took this idea to its logical extreme. Its paper title — "Attention Is All You Need" — says it plainly: it threw out recurrence and convolution entirely and built the whole model out of attention. To understand a Transformer, you really only need to understand this one operation well.

The bottleneck: older models compressed a whole sentence into one fixed vector, losing detail on long inputs.
The fix: attention lets the model look back at all words and weight them per step — a learned "soft alignment" (Bahdanau et al., 2014).
The leap: the Transformer is built almost entirely from attention, dispensing with recurrence and convolution (Vaswani et al., 2017).

02Queries, keys & values — the three roles of a word

Attention reframes each word as playing three roles at once. A useful mental model is a library lookup. Every word produces a query ("what am I looking for?"), a key ("what do I offer as a match?"), and a value ("what information do I carry?"). All three are just linear projections of the word's representation — the same vector pushed through three different learned weight matrices (Vaswani et al., 2017; Jurafsky & Martin, SLP3 ch. 9).

To compute attention for one word, you take its query and compare it against every word's key. A close match means "this word is relevant to me." Those match scores become weights, and the output is a weighted sum of the value vectors — you pull in more of the values from words you matched strongly, and less from the rest. That weighted blend, not any single word, is what attention produces.

Query (Q): what this word is looking for in the others.
Key (K): what each word advertises, to be matched against queries.
Value (V): the actual information a word contributes once it is selected.
Q, K, and V are three learned linear projections of the same input — and the output is a weighted sum of values, weighted by query-key match.

03Scaled dot-product attention, step by step

Here is the whole operation in one line, exactly as the Transformer defines it (Vaswani et al., 2017):

Attention(Q, K, V) = softmax( Q Kₜ / √dₖ ) V scores = how well each query matches each key · softmax turns them into weights that sum to 1 · the weights blend the values

Read it left to right. Q Kₜ takes the dot product of each query with every key — a raw compatibility score: bigger means more aligned. Those scores are then divided by √dₖ (the square root of the key dimension). This scaling matters: with large key dimensions the dot products grow large, which pushes the softmax into regions where its gradients are tiny; dividing by √dₖ keeps them in a healthy range. softmax turns the scaled scores into a set of positive weights that sum to one — the attention distribution. Finally, multiplying by V produces the weighted sum of value vectors.

Earlier work explored other ways to score query-key compatibility — Bahdanau et al. (2014) used an additive (concatenation) score, and Luong et al. (2015) compared multiplicative (dot-product) scoring plus global vs. local attention. The Transformer settled on scaled dot-product attention because it is fast and parallelizable as plain matrix multiplication.

Score: dot-product of each query with every key (Q Kₜ).
Scale: divide by √dₖ to keep softmax gradients stable at large dimensions.
Normalize: softmax turns scores into weights summing to 1.
Blend: multiply weights by the value vectors to get the output.

04Self-attention, masking & why position has to be added

Self-attention is the case where Q, K, and V all come from the same sequence — every position attends to every other position in the same sentence, letting a word gather context from its neighbours and beyond. Cross-attention instead lets one sequence attend to another, e.g. a decoder attending to an encoder's output during translation (Vaswani et al., 2017; Hugging Face docs). When a model generates text left to right, it uses masked (causal) attention so a position can only attend to earlier positions, never to words it has not produced yet.

One subtle but important consequence: attention is permutation-invariant over the set of tokens. The operation itself has no notion of word order — shuffle the inputs and the math does not change. So position must be added explicitly. The original Transformer used fixed sinusoidal positional encodings; later work added relative position representations (Shaw et al., 2018) and rotary embeddings / RoPE (Su et al., 2021), which encode relative position by rotating the query and key vectors.

Self vs cross: self-attention stays within one sequence; cross-attention bridges two (e.g. decoder → encoder).
Causal masking: in generation, a token may only attend to earlier tokens.
Position is bolted on: attention ignores order, so positional information is added — sinusoidal, relative, or rotary (RoPE).

05Multi-head attention — and a live look inside

A single attention calculation can only capture one kind of relationship at a time. Multi-head attention runs several attention operations in parallel, each on its own lower-dimensional projection of Q, K, and V, then concatenates and re-projects the results (Vaswani et al., 2017). The point is specialization: different heads can learn to attend to different kinds of relations — one might track which verb a subject belongs to, another which noun a pronoun refers to. No head is told what to look for; the roles emerge from training.

The interactive below makes this concrete. Pick a query word in the sentence and the chart shows the attention weights that word places on every other word — computed live from toy query and key vectors with the real scaled-dot-product-then-softmax pipeline. Switch heads to see how different heads concentrate on different words. The numbers here are illustrative, designed to show the shape of attention; they are not from a trained model.

InteractivePick a query word, switch heads

Query word — tap one

Attention head:

Many heads, one layer: several attention operations run in parallel on separate projections, then are concatenated and re-projected.
Specialization emerges: different heads learn to attend to different relations — this is learned, not hand-designed.
Each head still uses the same softmax(QKₜ/√dₖ)V math — multi-head just runs it several times in different subspaces.

06The catch: attention scales quadratically

Because every token attends to every other token, standard self-attention's time and memory cost grows quadratically with sequence length — double the context and you roughly quadruple the work. This is the main bottleneck for long inputs (Dao et al., 2022; Beltagy et al., 2020), and a great deal of research targets it. The variants split into two camps that you should not conflate:

Exact, just faster — FlashAttention

FlashAttention is an IO-aware exact attention algorithm: it computes the same result but uses tiling to cut the number of memory reads and writes between the GPU's high-bandwidth memory and its on-chip SRAM. FlashAttention-2 improves parallelism and work partitioning for further speedups (Dao et al., 2022; Dao, 2023). Crucially, the output is mathematically the same — only the memory access pattern changes.

key idea: reorganize the computation for the hardware; identical math, far fewer slow memory trips

Share key/value heads — MQA & GQA

During generation, the main cost is moving the cached keys and values for each head. Multi-query attention (MQA) shares a single key/value head across all query heads to cut that bandwidth (Shazeer, 2019); grouped-query attention (GQA) uses an intermediate number of key/value groups, trading off between full multi-head quality and MQA speed (Ainslie et al., 2023).

key idea: keep many query heads but fewer key/value heads — less memory traffic when decoding

Approximate the attention matrix

These methods change the math to dodge the quadratic cost, accepting some approximation. Longformer uses local windowed plus global sparse attention (linear scaling); Reformer uses locality-sensitive hashing (O(L log L)) and reversible layers; Linformer projects to a low rank (O(n)); Performer approximates softmax attention with FAVOR+ random features (linear). Unlike FlashAttention, these are approximate and may trade model quality for speed.

key idea: replace full all-to-all attention with a cheaper approximation — faster, but not exact

07Check your understanding

TJS Quiz

08Take it with you & go deeper

"The attention mechanism" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

How transformers work

Zoom back out: how attention blocks stack into the full Transformer architecture.

Read →

Live lesson

Tokens & context windows

Why the quadratic cost of attention puts a ceiling on context length.

Read →

▸ Coming next — deeper progression

Coming soon

Mixture of Experts (MoE)

Another way to scale transformers — routing tokens to specialized sub-networks.

Coming soon

Decoding & sampling

Once attention has produced its output, how the model actually picks the next token.

Coming soon

⊕Concept map

The whole lesson at a glance — expand a branch to see the grounded points underneath it.

The problem attention solves

Older models compressed a whole sentence into one fixed-length vector, losing detail on long inputs.
Attention lets the model look back at all words and weight them per step — a learned "soft alignment" (Bahdanau et al., 2014).
The Transformer (Vaswani et al., 2017) built the whole model from attention, dropping recurrence and convolution.

Queries, keys & values

Query: what a word is looking for. Key: what each word advertises. Value: the information it contributes.
Q, K, and V are three learned linear projections of the same input representation.
The output is a weighted sum of value vectors, weighted by query-key match.

Scaled dot-product attention

Score: dot-product of each query with every key (QKₜ).
Scale: divide by √dₖ to keep softmax gradients stable at large key dimensions.
Normalize then blend: softmax turns scores into weights summing to 1, then multiply by V.

Self-attention, masking & position

Self vs cross: self-attention stays within one sequence; cross-attention bridges two (decoder → encoder).
Causal masking: in generation a token may only attend to earlier tokens.
Attention is permutation-invariant, so position is added explicitly — sinusoidal, relative, or rotary (RoPE).

Multi-head attention

Several attention operations run in parallel on separate projections, then are concatenated and re-projected.
Different heads learn to attend to different relations — this specialization is learned, not hand-designed.
Each head still uses the same softmax(QKₜ/√dₖ)V math, just in a different subspace.

The catch: quadratic scaling

Cost grows quadratically with sequence length — the main bottleneck for long inputs.
Exact, just faster: FlashAttention / FlashAttention-2 reorganize memory access for identical math.
Fewer KV heads: MQA (Shazeer 2019) and GQA (Ainslie et al. 2023) cut decode-time memory traffic.
Approximate: Longformer, Reformer, Linformer, Performer trade some quality for sub-quadratic cost.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the numbers shown in the interactive are illustrative and labelled as such.

Attention Is All You Need — Vaswani et al. (2017)
Neural Machine Translation by Jointly Learning to Align and Translate — Bahdanau, Cho & Bengio (2014)
Effective Approaches to Attention-based Neural Machine Translation — Luong, Pham & Manning (2015)
Speech and Language Processing (3rd ed.), Ch. 9 — The Transformer — Jurafsky & Martin
FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al. (2022)
FlashAttention-2: Faster Attention with Better Parallelism — Dao (2023)
Fast Transformer Decoding: One Write-Head is All You Need (MQA) — Shazeer (2019)
GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al. (2023)
Self-Attention with Relative Position Representations — Shaw, Uszkoreit & Vaswani (2018)
RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) — Su et al. (2021)
The Annotated Transformer — Harvard NLP
The Illustrated Transformer — Jay Alammar

Responsible use

This is an educational explainer about how the attention mechanism works. The query/key/value vectors and attention weights in the interactive are illustrative toy numbers chosen to show the shape of attention — they are not from a trained model, and real attention patterns depend on the model, data, and task.

AI systems can produce plausible-sounding but incorrect output. For decisions that carry real consequences, verify against primary sources and consult a qualified professional. See the NIST AI Risk Management Framework for guidance on responsible AI.

The attention mechanism — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Why attention exists

Earlier models squeezed a whole sentence into one fixed vector, losing detail. Attention (Bahdanau et al., 2014) lets the model look back at every input word and weight them per step. The Transformer (Vaswani et al., 2017) built the whole model from attention, dropping recurrence and convolution.

Queries, keys & values

Each word becomes three learned linear projections: a query (what I want), a key (what I offer), and a value (what I carry). The output is a weighted sum of values, weighted by how well each query matches each key.

Scaled dot-product attention

Attention(Q,K,V) = softmax(QKₜ / √dₖ) V. Score with the dot product, scale by √dₖ to keep softmax gradients stable, softmax to weights summing to 1, then blend the values.

Self-attention, masking & position

Self-attention: Q/K/V from one sequence; cross-attention bridges two. Causal masking limits a token to earlier positions. Attention is permutation-invariant, so position is added explicitly (sinusoidal, relative, or rotary/RoPE).

Multi-head & scaling

Multi-head runs several attention heads in parallel on projected subspaces, then concatenates and re-projects; heads can specialize on different relations. Standard attention costs grow quadratically with length; FlashAttention is exact-but-faster, MQA/GQA share key/value heads, and Longformer/Reformer/Linformer/Performer are approximate variants.

Gallery

Contacts

How attention actually works

01The problem attention was built to solve

02Queries, keys & values — the three roles of a word

03Scaled dot-product attention, step by step

04Self-attention, masking & why position has to be added

05Multi-head attention — and a live look inside