Foundations learning vertical

Track 01 · Foundations Intermediate ~9 min

How do transformers work?

You already know a neural network. The transformer is the architecture that learned to read whole sequences at once — using a mechanism called attention so each word can look at every other word. Learn why sequences need attention, how tokens become vectors, how self-attention works as a soft lookup, multi-head attention, positional encoding, and why this design powers modern large language models — right here on the page.

Module progress

01Why sequences need attention

To understand a sentence, you can't read each word in isolation — what a word means depends on the words around it. Take "the animal didn't cross the street because it was too tired": to know what "it" points to, you have to look back at the rest of the line. Older AI systems read text strictly left to right, one word at a time, trying to remember what came before — which made it hard to connect words that sit far apart. The transformer took a different path: let every word look directly at every other word, all at once. That trick of letting words look at each other is called attention. Toggle the diagram to compare the two ways of reading a sentence.

WalkthroughSequential vs. parallel

step at a time

Sequential (RNN-style): each token is read in order, passing a memory forward. Far-apart words are hard to connect, and nothing can be done in parallel.

A sequence is ordered data — like the words of a sentence — where meaning depends on context.
Sequential models pass information step by step, so long-range relationships between distant words are easy to lose.
Attention lets every position relate directly to every other position, in parallel — the transformer's core idea.

02From text to vectors: tokens, embeddings, and the block

A transformer doesn't read letters. First the text is split into tokens — words or word-pieces. Each token is mapped to an embedding: a list of numbers (a vector) that captures its meaning, so related words sit near each other. Because attention treats positions in parallel, a positional encoding is added so the model still knows word order. These vectors then pass through a stack of identical transformer blocks, each combining attention with a small feed-forward network. Tap each part to see what it does.

ExploreTap a part

Inputs to a transformer

Tokenstext, chopped up

Embeddingsmeaning as vectors

Positional encodingwhere each token sits

Transformer blockattention + feed-forward

The starting point

Tokens

Before anything else, the input text is broken into tokens — whole words or common word-pieces. Tokens, not raw characters, are the units a transformer works with; each one will be turned into a vector in the next step.

03Self-attention: each token looks at the others

This is the heart of the transformer. Self-attention works like a soft lookup. For every token, the model derives three vectors: a Query (what am I looking for?), a Key (what do I offer?), and a Value (the information I carry). A token's Query is compared against every other token's Key to produce attention weights — how much to focus on each token. The output is a blend of all the tokens' Values, weighted by those scores. The result: each token's representation is updated using the context most relevant to it. Pick a token below to see what it attends to.

InteractivePick a token — or step through them

Line thickness = attention weight (a softmax over the scores, so the weights for one token sum to 1). These weights are illustrative, chosen to show the idea — a trained model learns its own. Use Tab + Enter, or the buttons, to step through tokens.

Query · Key · Value are three different vectors derived from each token — the building blocks of the lookup.
Comparing one token's Query to every Key gives attention weights; a softmax turns them into a set of scores that add up to 1.
The token's new representation is the weighted sum of all the Values — so it absorbs context from the tokens that matter most.

04Multi-head attention & keeping track of order

One round of attention captures one kind of relationship. Transformers run several attention operations in parallel — multi-head attention — so different "heads" can focus on different patterns (one on grammar, another on which word refers to which, and so on). Their results are combined. And because attention has no built-in sense of order, a positional encoding is added to each token's embedding so the model knows where each token sits. Step through how a token's representation gets built and refined.

WalkthroughStep or run the pipeline

Embed + position

Tokens become vectors. Each token's embedding is combined with a positional encoding so the model knows both what the token is and where it sits in the sequence.

Q · K · V

Project. From each vector the block derives a Query, a Key, and a Value — the three roles every token plays in the attention lookup.

Multi-head attention

Look around, many ways. Several attention heads run in parallel; each blends Values by its own attention weights, capturing a different kind of relationship. Their outputs are concatenated and combined.

Feed-forward

Process per token. A small feed-forward network transforms each token's vector further. Residual connections and normalization keep the deep stack trainable.

Stack & repeat

Go deeper. The block's output feeds the next identical block. After many layers, each token's vector is a rich, context-aware representation ready for the model's final prediction.

Multiple heads let the model attend to several types of relationship at once, then merge what they find.
Positional encoding restores word order, which pure attention would otherwise ignore.
Stacking many identical blocks is what turns simple lookups into deep, context-rich understanding.

05Encoder, decoder & why LLMs use this design

The original transformer had two halves — an encoder that reads the input and a decoder that generates the output. Different jobs use different slices of that design. Switch between the three you'll meet most:

ExploreSwitch type

Encoder–decoder — the original transformer

The 2017 design pairs an encoder that builds a rich representation of the input with a decoder that produces the output one token at a time, attending back to the encoder. It fits tasks that map one sequence to another, like translation or summarization.

good for: translation, summarization (sequence → sequence)

idea: read with the encoder, write with the decoder

Encoder-only — built for understanding

An encoder-only transformer keeps just the reading half. Every token can attend to every other token in both directions, producing representations good for classifying, searching, or answering questions about a given text rather than freely generating new text.

good for: classification, search/embeddings, understanding

idea: read the whole input both ways at once

Decoder-only — built for generation

A decoder-only transformer keeps just the writing half and predicts the next token from the tokens so far — attending only to earlier positions (masked, or "causal," attention). This next-token-prediction setup is the basis of most modern text-generating large language models.

good for: chat, code, open-ended text generation

idea: predict the next token, left to right

Why transformers power modern LLMs

Two properties made transformers the foundation of large language models: attention captures long-range relationships across a whole sequence, and the architecture parallelizes well, so it trains efficiently on huge amounts of text. Scaling that recipe up in data and parameters is what produced today's LLMs.

strengths: long-range context · parallel training · scales

result: the backbone of modern LLMs

06Check your understanding

TJS Quiz

Certificate of Completion

'+esc(D.topic||'Quiz')+'

This recognizes

'+(name||'—')+'

for completing the assessment at the '+esc(cat)+' level ('+pct+'%).

'+ds+' · TJS AI Knowledge Hub · ID '+id+'

A self-assessment summary recognizing completion of an educational module — not a professional certification.

window.onload=function(){window.print();}<\/scr'+'ipt>'; var w=window.open('','_blank'); if(w){ w.document.write(html); w.document.close(); } } renderStart(); })();

07Take it with you & go deeper

"Transformers in 5 minutes" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Coming next — deeper progression (specced & grounded)

Coming soon

Attention math, step by step

Scaled dot-product attention, softmax, and how Q/K/V scores become a weighted blend of values.

In the pipeline

Coming soon

From transformer to LLM

Pre-training, next-token prediction, and how decoder-only transformers scale into large language models.

In the pipeline

→Continue learning

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

But what is a GPT? Visual intro to transformers — 3Blue1Brown
The Illustrated Transformer — Jay Alammar
Hugging Face LLM Course — How transformers work — Hugging Face
Tokenizers summary — Hugging Face Transformers docs
Attention Is All You Need — Vaswani et al. (2017)
The Annotated Transformer — Harvard NLP
Attention in transformers, visually explained — 3Blue1Brown
Hugging Face LLM Course — Transformer architectures — Hugging Face

How transformers work — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Why attention

Language is a sequence where meaning depends on context. Older models read step by step, losing far-apart links. The transformer lets every token relate to every other token in parallel — a mechanism called attention.

Tokens & embeddings

Text is split into tokens (words/word-pieces). Each becomes an embedding — a vector capturing meaning. A positional encoding is added so the model knows word order, since attention is order-agnostic.

Self-attention (Q/K/V)

Each token yields a Query, Key, and Value. A token's Query is compared to every Key to produce attention weights (via softmax, summing to 1); the output is the weighted sum of all the Values — so each token absorbs the most relevant context.

Multi-head & blocks

Several attention heads run in parallel to capture different relationships, then combine. A token also passes through a small feed-forward network. Stacking many identical transformer blocks builds deep, context-aware representations.

Architectures & LLMs

Encoder–decoder — the original, for sequence-to-sequence tasks. Encoder-only — for understanding/classification. Decoder-only — next-token prediction, the basis of most generative LLMs. Long-range context + parallel training is why transformers scaled into modern LLMs.

Gallery

Contacts