What are tokens and the context window?
Two ideas explain almost everything about how you pay for an AI, why it sometimes "forgets," and why some prompts get cut off. A model doesn't read words — it reads tokens. And it can only hold so many of them at once: its context window. Type into the live tokenizer below and watch both come to life.
01A token is not a word
Before an AI model can read your text, it first chops it into small pieces — a bit like slicing a sentence into LEGO bricks it can snap together. Each of those pieces is called a token. A token is a sub-word unit — sometimes a whole short word, sometimes just a fragment of a longer one, sometimes a space or a piece of punctuation. So cat might be one token, while antidisestablishmentarianism could be several. The key idea: one word does not always equal one token. Type below and watch your text get chopped into token chips.
- A token is the model's smallest unit of text — not a word, not a letter.
- Common words often stay whole; longer or rarer words split into pieces.
- Spaces and punctuation are part of the split too, so formatting affects the count.
Approximation only — real tokenizers use learned sub-word pieces (e.g. BPE), so these counts are illustrative, not exact. This demo splits on spaces and punctuation and chunks long words into ~4-character pieces to suggest how text gets divided.
The 8,192-token (8K) window is just an illustrative example so the bar moves as you type — real models have their own published limits. Both your input and the model's reply share this budget.
02How tokenization works: byte-pair encoding
How does a model decide where to cut? It doesn't use a dictionary of every possible word — that would be enormous and would still miss new words. Instead it uses a learned scheme. The most common one is byte-pair encoding (BPE). The high-level idea: start from tiny pieces (individual characters), then repeatedly merge the most frequent adjacent pair into a single new token, over and over, until you have a vocabulary of useful sub-word pieces.
- BPE learns from data which pieces are worth keeping — it isn't hand-written per word.
- This lets a model represent a rare or brand-new word by combining smaller known pieces.
- Different tokenizers split the same text differently, so token counts are tokenizer-specific.
A quick mental model: the word tokenization might split into token + ization, while cat stays whole. Frequent chunks become their own tokens; uncommon spellings get assembled from smaller parts. The live tokenizer above only approximates this — to see a real split, paste text into a tokenizer tool (linked in the quiz study plan).
03Why tokens matter: cost, limits & other languages
Once you know text is measured in tokens, three practical things click into place:
- Cost. API usage is billed by the token — both what you send (input) and what the model writes back (output). Fewer tokens, lower cost. That's why a concise prompt is cheaper than a padded one.
- Limits. Every model has a maximum number of tokens it can handle at once. Your text has to fit — which is exactly what the context window in the next section is about.
- Language efficiency. Tokenizers are often trained mostly on English text. The same idea written in another language can split into more tokens — meaning it can cost more and fill the window faster for equivalent content.
The takeaway for everyday use: when something gets expensive or gets cut off, the cause is almost always token count. You can usually fix it by tightening your text, removing redundant formatting, or trimming what you paste in.
04The context window: the model's working memory
A model doesn't have a memory of everything you've ever told it. For each request, it can only "see" a limited amount of text at once — its context window. Think of it as the model's working memory: a desk of a fixed size. Whatever is on the desk, the model can use. Anything that doesn't fit on the desk simply isn't there.
- The window is measured in tokens — the same units the tokenizer produces.
- It's a shared budget: your prompt and the model's reply both have to fit inside it.
- It is not long-term memory — only what's inside the window right now influences the answer.
This is why a long chat can start to "forget" what you said early on, and why pasting a huge document can leave no room for a useful reply. The meter in the live tokenizer above is a toy version of this idea: as your text grows, you can watch it eat into an example window — and see what happens as you approach the edge.
05When you run out of room: truncation, RAG & long documents
So what happens when your text won't fit? The honest answer: something has to give. If your input is larger than the window, it gets truncated — the part that doesn't fit is simply dropped, and the model never sees it. That's why a model can confidently ignore a detail buried in a document you pasted: to it, that detail was never there.
- Truncation. Text beyond the limit is cut. The window doesn't quietly grow to fit you.
- "Lost in the middle." Even within a long window, models tend to use information at the start and end more reliably than content buried in the middle — so where you put something matters.
- Long documents & RAG. When material is far bigger than the window, you don't stuff it all in. Retrieval-augmented generation (RAG) fetches just the most relevant chunks and places those in the window.
This is why context length matters so much for real work like answering questions over long PDFs or large codebases: it sets how much relevant material you can put in front of the model at once. Practical moves when you're tight on room: summarize earlier parts of a conversation, retrieve only what's relevant instead of pasting everything, and place key information where the model uses it well.