What are tokens and the context window?
Two ideas explain almost everything about how you pay for an AI, why it sometimes "forgets," and why some prompts get cut off. A model doesn't read words. It reads tokens. And it can only hold so many of them at once: its context window. Type into the live tokenizer below and watch both come to life.
01A token is not a word
Before an AI model can read your text, it first chops it into small pieces, a bit like slicing a sentence into LEGO bricks it can snap together. Each of those pieces is called a token. A token is a sub-word unit: sometimes a whole short word, sometimes just a fragment of a longer one, sometimes a space or a piece of punctuation. So cat might be one token, while antidisestablishmentarianism could be several. The key idea: one word does not always equal one token. Type below and watch your text get chopped into token chips.
Anchor your AI program in a charter. The AI Governance Charter: establish ownership, scope, and accountability for AI.
Your purchase helps keep our hubs free to read.
- A token is the model's smallest unit of text, not a word, not a letter.
- Common words often stay whole; longer or rarer words split into pieces.
- Spaces and punctuation are part of the split too, so formatting affects the count.
Approximation only: real tokenizers use learned sub-word pieces (e.g. BPE), so these counts are illustrative, not exact. This demo splits on spaces and punctuation and chunks long words into ~4-character pieces to suggest how text gets divided.
The 8,192-token (8K) window is just an illustrative example so the bar moves as you type. Real models have their own published limits. Both your input and the model's reply share this budget.
02How tokenization works: byte-pair encoding
How does a model decide where to cut? It doesn't use a dictionary of every possible word; that would be enormous and would still miss new words. Instead it uses a learned scheme. The most common one is byte-pair encoding (BPE). The high-level idea: start from tiny pieces (individual characters), then repeatedly merge the most frequent adjacent pair into a single new token, over and over, until you have a vocabulary of useful sub-word pieces.
- BPE learns from data which pieces are worth keeping; it isn't hand-written per word.
- This lets a model represent a rare or brand-new word by combining smaller known pieces.
- Different tokenizers split the same text differently, so token counts are tokenizer-specific.
A quick mental model: the word tokenization might split into token + ization, while cat stays whole. Frequent chunks become their own tokens; uncommon spellings get assembled from smaller parts. The live tokenizer above only approximates this. To see a real split, paste text into a tokenizer tool (linked in the quiz study plan).
03Why tokens matter: cost, limits & other languages
Once you know text is measured in tokens, three practical things click into place:
- Cost. API usage is billed by the token: both what you send (input) and what the model writes back (output). Fewer tokens, lower cost. That's why a concise prompt is cheaper than a padded one.
- Limits. Every model has a maximum number of tokens it can handle at once. Your text has to fit, which is exactly what the context window in the next section is about.
- Language efficiency. Tokenizers are often trained mostly on English text. The same idea written in another language can split into more tokens, meaning it can cost more and fill the window faster for equivalent content.
The takeaway for everyday use: when something gets expensive or gets cut off, the cause is almost always token count. You can usually fix it by tightening your text, removing redundant formatting, or trimming what you paste in.
04The context window: the model's working memory
A model doesn't have a memory of everything you've ever told it. For each request, it can only "see" a limited amount of text at once: its context window. Think of it as the model's working memory: a desk of a fixed size. Whatever is on the desk, the model can use. Anything that doesn't fit on the desk simply isn't there.
- The window is measured in tokens, the same units the tokenizer produces.
- It's a shared budget: your prompt and the model's reply both have to fit inside it.
- It is not long-term memory; only what's inside the window right now influences the answer.
This is why a long chat can start to "forget" what you said early on, and why pasting a huge document can leave no room for a useful reply. The meter in the live tokenizer above is a toy version of this idea: as your text grows, you can watch it eat into an example window, and see what happens as you approach the edge.
05When you run out of room: truncation, RAG & long documents
So what happens when your text won't fit? The honest answer: something has to give. If your input is larger than the window, it gets truncated: the part that doesn't fit is simply dropped, and the model never sees it. That's why a model can confidently ignore a detail buried in a document you pasted: to it, that detail was never there.
- Truncation. Text beyond the limit is cut. The window doesn't quietly grow to fit you.
- "Lost in the middle." Even within a long window, models tend to use information at the start and end more reliably than content buried in the middle, so where you put something matters.
- Long documents & RAG. When material is far bigger than the window, you don't stuff it all in. Retrieval-augmented generation (RAG) fetches just the most relevant chunks and places those in the window.
This is why context length matters so much for real work like answering questions over long PDFs or large codebases: it sets how much relevant material you can put in front of the model at once. Practical moves when you're tight on room: summarize earlier parts of a conversation, retrieve only what's relevant instead of pasting everything, and place key information where the model uses it well.
06Check your understanding
You finished Tokens & Context Windows
Here’s where it sits in your path, and the strongest next move.
Recommended next
Speech recognition, text-to-speech and how spoken-language AI works.
The Attention Mechanism (Deep Dive)
Continue with The Attention Mechanism (Deep Dive).
Open lesson → LanguageHow LLMs work (tokens)
Tokens, attention, training and inference, in plain language.
Open lesson → LanguageGenerative AI
How models generate text and images, the key concepts, and real uses.
Open lesson →A token is not a word
- A token is a sub-word unit, the model's smallest unit of text, not a word or a letter.
- Common words often stay whole; longer or rarer words split into pieces.
- Spaces and punctuation are part of the split, so one word does not always equal one token.
How tokenization works: byte-pair encoding
- Byte-pair encoding (BPE) starts from tiny pieces and repeatedly merges the most frequent adjacent pair.
- The vocabulary is learned from data, so rare or brand-new words are built from smaller known pieces.
- Different tokenizers split the same text differently, so counts are tokenizer-specific.
Why tokens matter: cost, limits & other languages
- Cost: API usage is billed per token, for both input and output.
- Limits: every model has a maximum number of tokens it can handle at once.
- Language efficiency: non-English text can split into more tokens, costing more and filling the window faster.
The context window: the model's working memory
- The most text a model can consider at once, measured in tokens.
- A shared budget: your prompt and the model's reply both have to fit inside it.
- It is not long-term memory; only what's inside the window right now influences the answer.
When you run out of room: truncation, RAG & long documents
- Truncation: text beyond the limit is dropped, and the model never sees it.
- "Lost in the middle": models use content at the start and end more reliably than the middle.
- For material bigger than the window, summarize or use RAG to place only relevant chunks inside it.
Every claim below links to its primary source so you can go straight to the original.
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.
Tokens & context windows in 5 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
A token is not a word
A model reads tokens, sub-word units. A short word may be one token; a long or rare word splits into several. One word does not always equal one token, and spaces and punctuation count too.
How tokenization works (BPE)
The split is learned, not hand-written. Byte-pair encoding (BPE) starts from small pieces and repeatedly merges the most frequent adjacent pair into a new token. This lets a model build rare words from known pieces. Different tokenizers split the same text differently.
Why tokens matter
Usage and cost are billed per token (input + output), and every model has a token limit. The same idea can take more tokens in some languages than in English, so it can cost more and fill the window faster.
The context window
The context window is how much text a model can consider at once, measured in tokens: its working memory. Your prompt and the reply share this budget. It is not long-term memory: only what's inside it influences the answer.
When you run out of room
Exceeding the window causes truncation: text that doesn't fit is dropped. In long contexts, models often use the start and end better than the middle ("lost in the middle"). To fit large material, summarize, or use retrieval (RAG) to include only the most relevant chunks.
Use it wisely
When something gets expensive or gets cut off, suspect token count. Tighten your text, count tokens with a real tokenizer tool, and put the most important information where the model will actually use it.