Language & Generation · learning vertical

Track 01 · Language & Generation Novice · start here ~8 min

What are tokens and the context window?

Two ideas explain almost everything about how you pay for an AI, why it sometimes "forgets," and why some prompts get cut off. A model doesn't read words — it reads tokens. And it can only hold so many of them at once: its context window. Type into the live tokenizer below and watch both come to life.

Module progress

01A token is not a word

Before an AI model can read your text, it first chops it into small pieces — a bit like slicing a sentence into LEGO bricks it can snap together. Each of those pieces is called a token. A token is a sub-word unit — sometimes a whole short word, sometimes just a fragment of a longer one, sometimes a space or a piece of punctuation. So cat might be one token, while antidisestablishmentarianism could be several. The key idea: one word does not always equal one token. Type below and watch your text get chopped into token chips.

A token is the model's smallest unit of text — not a word, not a letter.
Common words often stay whole; longer or rarer words split into pieces.
Spaces and punctuation are part of the split too, so formatting affects the count.

Interactive · live tokenizerType to see tokens + the window meter

≈ Tokens: 0 Characters: 0 Words: 0

Approximation only — real tokenizers use learned sub-word pieces (e.g. BPE), so these counts are illustrative, not exact. This demo splits on spaces and punctuation and chunks long words into ~4-character pieces to suggest how text gets divided.

Context-window meter · example 8K window 0 / 8,192 tokens · 0%

Plenty of room. Your text fits comfortably in this example window.

The 8,192-token (8K) window is just an illustrative example so the bar moves as you type — real models have their own published limits. Both your input and the model's reply share this budget.

02How tokenization works: byte-pair encoding

How does a model decide where to cut? It doesn't use a dictionary of every possible word — that would be enormous and would still miss new words. Instead it uses a learned scheme. The most common one is byte-pair encoding (BPE). The high-level idea: start from tiny pieces (individual characters), then repeatedly merge the most frequent adjacent pair into a single new token, over and over, until you have a vocabulary of useful sub-word pieces.

BPE learns from data which pieces are worth keeping — it isn't hand-written per word.
This lets a model represent a rare or brand-new word by combining smaller known pieces.
Different tokenizers split the same text differently, so token counts are tokenizer-specific.

A quick mental model: the word tokenization might split into token + ization, while cat stays whole. Frequent chunks become their own tokens; uncommon spellings get assembled from smaller parts. The live tokenizer above only approximates this — to see a real split, paste text into a tokenizer tool (linked in the quiz study plan).

Why this matters: because tokenization is learned and tokenizer-specific, you can't reliably guess token counts by eye. When cost or limits matter, count tokens with the right tool rather than counting words.

03Why tokens matter: cost, limits & other languages

Once you know text is measured in tokens, three practical things click into place:

Cost. API usage is billed by the token — both what you send (input) and what the model writes back (output). Fewer tokens, lower cost. That's why a concise prompt is cheaper than a padded one.
Limits. Every model has a maximum number of tokens it can handle at once. Your text has to fit — which is exactly what the context window in the next section is about.
Language efficiency. Tokenizers are often trained mostly on English text. The same idea written in another language can split into more tokens — meaning it can cost more and fill the window faster for equivalent content.

The takeaway for everyday use: when something gets expensive or gets cut off, the cause is almost always token count. You can usually fix it by tightening your text, removing redundant formatting, or trimming what you paste in.

Practical tip: "make it shorter" isn't just about reading time — fewer words generally means fewer tokens, which means lower cost and more headroom in the window. To know the real number, count tokens with a tokenizer tool rather than guessing from word count.

04The context window: the model's working memory

A model doesn't have a memory of everything you've ever told it. For each request, it can only "see" a limited amount of text at once — its context window. Think of it as the model's working memory: a desk of a fixed size. Whatever is on the desk, the model can use. Anything that doesn't fit on the desk simply isn't there.

The window is measured in tokens — the same units the tokenizer produces.
It's a shared budget: your prompt and the model's reply both have to fit inside it.
It is not long-term memory — only what's inside the window right now influences the answer.

This is why a long chat can start to "forget" what you said early on, and why pasting a huge document can leave no room for a useful reply. The meter in the live tokenizer above is a toy version of this idea: as your text grows, you can watch it eat into an example window — and see what happens as you approach the edge.

05When you run out of room: truncation, RAG & long documents

So what happens when your text won't fit? The honest answer: something has to give. If your input is larger than the window, it gets truncated — the part that doesn't fit is simply dropped, and the model never sees it. That's why a model can confidently ignore a detail buried in a document you pasted: to it, that detail was never there.

Truncation. Text beyond the limit is cut. The window doesn't quietly grow to fit you.
"Lost in the middle." Even within a long window, models tend to use information at the start and end more reliably than content buried in the middle — so where you put something matters.
Long documents & RAG. When material is far bigger than the window, you don't stuff it all in. Retrieval-augmented generation (RAG) fetches just the most relevant chunks and places those in the window.

This is why context length matters so much for real work like answering questions over long PDFs or large codebases: it sets how much relevant material you can put in front of the model at once. Practical moves when you're tight on room: summarize earlier parts of a conversation, retrieve only what's relevant instead of pasting everything, and place key information where the model uses it well.

Worth knowing: a model only "knows" what's inside its context window for this request. If something important got truncated or buried in the middle, the answer can be wrong through no fault of the model — it never had the text. Treat important answers as a first draft, and check that the model actually had the information it needed before relying on it.

06Check your understanding

TJS Quiz

Certificate of Completion

'+esc(D.topic||'Quiz')+'

This recognizes

'+(name||'—')+'

for completing the assessment at the '+esc(cat)+' level ('+pct+'%).

'+ds+' · TJS AI Knowledge Hub · ID '+id+'

A self-assessment summary recognizing completion of an educational module — not a professional certification.

window.onload=function(){window.print();}<\/scr'+'ipt>'; var w=window.open('','_blank'); if(w){ w.document.write(html); w.document.close(); } } renderStart(); })();

07Take it with you & go deeper

"Tokens & context windows in 5 minutes" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Look up a term — AI glossary

Glossary

Token

What a token is and why billing and limits are measured in them.

Look up →

Glossary

Context window

The model's working memory — how much text it can consider at once.

Look up →

▸ Build on this — related lessons

Lesson

How LLMs work

See where tokens fit in the bigger picture: embed, attend, and predict the next token.

Open lesson →

Coming soon

Retrieval-augmented generation (RAG)

How models fit a document far bigger than the window by retrieving only the relevant pieces.

In the pipeline

→Continue learning

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

What are tokens and how to count them — OpenAI Help Center
Tokenizer — OpenAI Platform
Summary of the tokenizers — Hugging Face Transformers
Neural Machine Translation of Rare Words with Subword Units — Sennrich, Haddow & Birch (2015)
Lost in the Middle: How Language Models Use Long Contexts — Liu et al. (2023)

Tokens & context windows — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

A token is not a word

A model reads tokens — sub-word units. A short word may be one token; a long or rare word splits into several. One word does not always equal one token, and spaces and punctuation count too.

How tokenization works (BPE)

The split is learned, not hand-written. Byte-pair encoding (BPE) starts from small pieces and repeatedly merges the most frequent adjacent pair into a new token. This lets a model build rare words from known pieces. Different tokenizers split the same text differently.

Why tokens matter

Usage and cost are billed per token (input + output), and every model has a token limit. The same idea can take more tokens in some languages than in English, so it can cost more and fill the window faster.

The context window

The context window is how much text a model can consider at once, measured in tokens — its working memory. Your prompt and the reply share this budget. It is not long-term memory: only what's inside it influences the answer.

When you run out of room

Exceeding the window causes truncation — text that doesn't fit is dropped. In long contexts, models often use the start and end better than the middle ("lost in the middle"). To fit large material, summarize, or use retrieval (RAG) to include only the most relevant chunks.

Use it wisely

When something gets expensive or gets cut off, suspect token count. Tighten your text, count tokens with a real tokenizer tool, and put the most important information where the model will actually use it.

Gallery

Contacts