How do large language models work?
A chatbot can feel like it "knows" things. Under the hood it's doing something simpler and stranger: chopping your words into tokens, turning them into numbers that capture meaning, and predicting the most likely next token — over and over. No database lookup. Here's the whole idea, on one page.
01Tokens: the unit an LLM actually reads
Think of how you might break a long word into smaller chunks to sound it out. A large language model — the "LLM" behind a chatbot — does something similar before it reads anything: it chops your text into small pieces called tokens, often a whole word or just part of one. It then does one thing astonishingly well: predict the next token, again and again, to build a reply. There's no encyclopedia being consulted; it's pattern-completion learned from enormous amounts of text. Type below to watch text get chopped into tokens.
- Tokens are the model's unit of cost and context — pricing and limits are measured in tokens, not words.
- "Generation" is just next-token prediction repeated until the answer is done.
- Because it predicts plausibility, a model can sound certain while being wrong.
Simplified illustration. Real models use learned sub-word pieces (e.g. BPE) that don't line up with letters or word-length — so treat this as "roughly how text gets chopped up," not an exact token count.
Illustrative probabilities, for intuition only — to show that the model ranks candidates rather than "knowing" one answer.
02The four moves: tokenize, embed, attend, predict
Everything an LLM does chains together four ideas. Tap each one to see what it means — notice how each step hands its output to the next: tokens become embeddings, attention weighs which earlier tokens matter, and the model predicts the next token.
Tokenize
The model splits your text into tokens — sub-word pieces that are its smallest unit. Common words may be one token; rarer or longer words break into several. This is also why cost and context limits are counted in tokens, not words.
03The generation loop, step by step
Putting it together: an LLM answers by running a loop. It reads your prompt as tokens, turns them into meaning, weighs which earlier tokens matter, predicts one likely next token, appends it, and repeats — until it decides the answer is done. Step through it.