How do transformers work?
You already know a neural network. The transformer is the architecture that learned to read whole sequences at once — using a mechanism called attention so each word can look at every other word. Learn why sequences need attention, how tokens become vectors, how self-attention works as a soft lookup, multi-head attention, positional encoding, and why this design powers modern large language models — right here on the page.
01Why sequences need attention
To understand a sentence, you can't read each word in isolation — what a word means depends on the words around it. Take "the animal didn't cross the street because it was too tired": to know what "it" points to, you have to look back at the rest of the line. Older AI systems read text strictly left to right, one word at a time, trying to remember what came before — which made it hard to connect words that sit far apart. The transformer took a different path: let every word look directly at every other word, all at once. That trick of letting words look at each other is called attention. Toggle the diagram to compare the two ways of reading a sentence.
Sequential (RNN-style): each token is read in order, passing a memory forward. Far-apart words are hard to connect, and nothing can be done in parallel.
- A sequence is ordered data — like the words of a sentence — where meaning depends on context.
- Sequential models pass information step by step, so long-range relationships between distant words are easy to lose.
- Attention lets every position relate directly to every other position, in parallel — the transformer's core idea.
02From text to vectors: tokens, embeddings, and the block
A transformer doesn't read letters. First the text is split into tokens — words or word-pieces. Each token is mapped to an embedding: a list of numbers (a vector) that captures its meaning, so related words sit near each other. Because attention treats positions in parallel, a positional encoding is added so the model still knows word order. These vectors then pass through a stack of identical transformer blocks, each combining attention with a small feed-forward network. Tap each part to see what it does.
Tokens
Before anything else, the input text is broken into tokens — whole words or common word-pieces. Tokens, not raw characters, are the units a transformer works with; each one will be turned into a vector in the next step.
03Self-attention: each token looks at the others
This is the heart of the transformer. Self-attention works like a soft lookup. For every token, the model derives three vectors: a Query (what am I looking for?), a Key (what do I offer?), and a Value (the information I carry). A token's Query is compared against every other token's Key to produce attention weights — how much to focus on each token. The output is a blend of all the tokens' Values, weighted by those scores. The result: each token's representation is updated using the context most relevant to it. Pick a token below to see what it attends to.
Line thickness = attention weight (a softmax over the scores, so the weights for one token sum to 1). These weights are illustrative, chosen to show the idea — a trained model learns its own. Use Tab + Enter, or the buttons, to step through tokens.
- Query · Key · Value are three different vectors derived from each token — the building blocks of the lookup.
- Comparing one token's Query to every Key gives attention weights; a softmax turns them into a set of scores that add up to 1.
- The token's new representation is the weighted sum of all the Values — so it absorbs context from the tokens that matter most.
04Multi-head attention & keeping track of order
One round of attention captures one kind of relationship. Transformers run several attention operations in parallel — multi-head attention — so different "heads" can focus on different patterns (one on grammar, another on which word refers to which, and so on). Their results are combined. And because attention has no built-in sense of order, a positional encoding is added to each token's embedding so the model knows where each token sits. Step through how a token's representation gets built and refined.
positional encoding so the model knows both what the token is and where it sits in the sequence.Query, a Key, and a Value — the three roles every token plays in the attention lookup.heads run in parallel; each blends Values by its own attention weights, capturing a different kind of relationship. Their outputs are concatenated and combined.feed-forward network transforms each token's vector further. Residual connections and normalization keep the deep stack trainable.- Multiple heads let the model attend to several types of relationship at once, then merge what they find.
- Positional encoding restores word order, which pure attention would otherwise ignore.
- Stacking many identical blocks is what turns simple lookups into deep, context-rich understanding.
05Encoder, decoder & why LLMs use this design
The original transformer had two halves — an encoder that reads the input and a decoder that generates the output. Different jobs use different slices of that design. Switch between the three you'll meet most:
Encoder–decoder — the original transformer
The 2017 design pairs an encoder that builds a rich representation of the input with a decoder that produces the output one token at a time, attending back to the encoder. It fits tasks that map one sequence to another, like translation or summarization.
Encoder-only — built for understanding
An encoder-only transformer keeps just the reading half. Every token can attend to every other token in both directions, producing representations good for classifying, searching, or answering questions about a given text rather than freely generating new text.
Decoder-only — built for generation
A decoder-only transformer keeps just the writing half and predicts the next token from the tokens so far — attending only to earlier positions (masked, or "causal," attention). This next-token-prediction setup is the basis of most modern text-generating large language models.
Why transformers power modern LLMs
Two properties made transformers the foundation of large language models: attention captures long-range relationships across a whole sequence, and the architecture parallelizes well, so it trains efficiently on huge amounts of text. Scaling that recipe up in data and parameters is what produced today's LLMs.