Language & Generation · learning lesson

Track 01 · Language & Generation Novice · start here ~8 min

What are tokens and the context window?

Two ideas explain almost everything about how you pay for an AI, why it sometimes "forgets," and why some prompts get cut off. A model doesn't read words. It reads tokens. And it can only hold so many of them at once: its context window. Type into the live tokenizer below and watch both come to life.

Module progress

01A token is not a word

Before an AI model can read your text, it first chops it into small pieces, a bit like slicing a sentence into LEGO bricks it can snap together. Each of those pieces is called a token. A token is a sub-word unit: sometimes a whole short word, sometimes just a fragment of a longer one, sometimes a space or a piece of punctuation. So cat might be one token, while antidisestablishmentarianism could be several. The key idea: one word does not always equal one token. Type below and watch your text get chopped into token chips.

A token is the model's smallest unit of text, not a word, not a letter.
Common words often stay whole; longer or rarer words split into pieces.
Spaces and punctuation are part of the split too, so formatting affects the count.

Interactive · live tokenizerType to see tokens + the window meter

≈ Tokens: 0 Characters: 0 Words: 0

Approximation only: real tokenizers use learned sub-word pieces (e.g. BPE), so these counts are illustrative, not exact. This demo splits on spaces and punctuation and chunks long words into ~4-character pieces to suggest how text gets divided.

Context-window meter · example 8K window 0 / 8,192 tokens · 0%

Plenty of room. Your text fits comfortably in this example window.

The 8,192-token (8K) window is just an illustrative example so the bar moves as you type. Real models have their own published limits. Both your input and the model's reply share this budget.

02How tokenization works: byte-pair encoding

How does a model decide where to cut? It doesn't use a dictionary of every possible word; that would be enormous and would still miss new words. Instead it uses a learned scheme. The most common one is byte-pair encoding (BPE). The high-level idea: start from tiny pieces (individual characters), then repeatedly merge the most frequent adjacent pair into a single new token, over and over, until you have a vocabulary of useful sub-word pieces.

BPE learns from data which pieces are worth keeping; it isn't hand-written per word.
This lets a model represent a rare or brand-new word by combining smaller known pieces.
Different tokenizers split the same text differently, so token counts are tokenizer-specific.

A quick mental model: the word tokenization might split into token + ization, while cat stays whole. Frequent chunks become their own tokens; uncommon spellings get assembled from smaller parts. The live tokenizer above only approximates this. To see a real split, paste text into a tokenizer tool (linked in the quiz study plan).

Why this matters: because tokenization is learned and tokenizer-specific, you can't reliably guess token counts by eye. When cost or limits matter, count tokens with the right tool rather than counting words.

03Why tokens matter: cost, limits & other languages

Once you know text is measured in tokens, three practical things click into place:

Cost. API usage is billed by the token: both what you send (input) and what the model writes back (output). Fewer tokens, lower cost. That's why a concise prompt is cheaper than a padded one.
Limits. Every model has a maximum number of tokens it can handle at once. Your text has to fit, which is exactly what the context window in the next section is about.
Language efficiency. Tokenizers are often trained mostly on English text. The same idea written in another language can split into more tokens, meaning it can cost more and fill the window faster for equivalent content.

The takeaway for everyday use: when something gets expensive or gets cut off, the cause is almost always token count. You can usually fix it by tightening your text, removing redundant formatting, or trimming what you paste in.

Practical tip: "make it shorter" isn't only about reading time. Fewer words generally means fewer tokens, which means lower cost and more headroom in the window. To know the real number, count tokens with a tokenizer tool rather than guessing from word count.

04The context window: the model's working memory

A model doesn't have a memory of everything you've ever told it. For each request, it can only "see" a limited amount of text at once: its context window. Think of it as the model's working memory: a desk of a fixed size. Whatever is on the desk, the model can use. Anything that doesn't fit on the desk simply isn't there.

The window is measured in tokens, the same units the tokenizer produces.
It's a shared budget: your prompt and the model's reply both have to fit inside it.
It is not long-term memory; only what's inside the window right now influences the answer.

This is why a long chat can start to "forget" what you said early on, and why pasting a huge document can leave no room for a useful reply. The meter in the live tokenizer above is a toy version of this idea: as your text grows, you can watch it eat into an example window, and see what happens as you approach the edge.

05When you run out of room: truncation, RAG & long documents

So what happens when your text won't fit? The honest answer: something has to give. If your input is larger than the window, it gets truncated: the part that doesn't fit is simply dropped, and the model never sees it. That's why a model can confidently ignore a detail buried in a document you pasted: to it, that detail was never there.

Truncation. Text beyond the limit is cut. The window doesn't quietly grow to fit you.
"Lost in the middle." Even within a long window, models tend to use information at the start and end more reliably than content buried in the middle, so where you put something matters.
Long documents & RAG. When material is far bigger than the window, you don't stuff it all in. Retrieval-augmented generation (RAG) fetches just the most relevant chunks and places those in the window.

This is why context length matters so much for real work like answering questions over long PDFs or large codebases: it sets how much relevant material you can put in front of the model at once. Practical moves when you're tight on room: summarize earlier parts of a conversation, retrieve only what's relevant instead of pasting everything, and place key information where the model uses it well.

Worth knowing: a model only "knows" what's inside its context window for this request. If something important got truncated or buried in the middle, the answer can be wrong through no fault of the model; it never had the text. Treat important answers as a first draft, and check that the model actually had the information it needed before relying on it.

06Check your understanding

TJS Quiz

Keep going

You finished Tokens & Context Windows

Here’s where it sits in your path, and the strongest next move.

FoundationsLanguage & modelsAgenticGovernance

▸

Recommended next

Speech & Voice AI

Speech recognition, text-to-speech and how spoken-language AI works.

Start lesson →

Build on this

Language

The Attention Mechanism (Deep Dive)

Continue with The Attention Mechanism (Deep Dive).

Open lesson → Language

How LLMs work (tokens)

Tokens, attention, training and inference, in plain language.

Open lesson → Language

Generative AI

How models generate text and images, the key concepts, and real uses.

Open lesson →

Go deeper

Language

Transformers

Attention, embeddings and the architecture behind modern AI.

Open lesson →

⊕The lesson at a glance

A token is not a word

A token is a sub-word unit, the model's smallest unit of text, not a word or a letter.
Common words often stay whole; longer or rarer words split into pieces.
Spaces and punctuation are part of the split, so one word does not always equal one token.

How tokenization works: byte-pair encoding

Byte-pair encoding (BPE) starts from tiny pieces and repeatedly merges the most frequent adjacent pair.
The vocabulary is learned from data, so rare or brand-new words are built from smaller known pieces.
Different tokenizers split the same text differently, so counts are tokenizer-specific.

Why tokens matter: cost, limits & other languages

Cost: API usage is billed per token, for both input and output.
Limits: every model has a maximum number of tokens it can handle at once.
Language efficiency: non-English text can split into more tokens, costing more and filling the window faster.

The context window: the model's working memory

The most text a model can consider at once, measured in tokens.
A shared budget: your prompt and the model's reply both have to fit inside it.
It is not long-term memory; only what's inside the window right now influences the answer.

When you run out of room: truncation, RAG & long documents

Truncation: text beyond the limit is dropped, and the model never sees it.
"Lost in the middle": models use content at the start and end more reliably than the middle.
For material bigger than the window, summarize or use RAG to place only relevant chunks inside it.

⇩Take it with you

⎘

One-page summaryThe whole lesson on a printable cheat-sheet.

Every claim below links to its primary source so you can go straight to the original.

✓ VerifiedPublished by Tech Jacks Solutions · Reviewed June 2026 · Grounded in 5 sources

What are tokens and how to count themOpenAI Help Center TokenizerOpenAI Platform Summary of the tokenizersHugging Face Transformers Neural Machine Translation of Rare Words with Subword UnitsSennrich, Haddow & Birch (2015) Lost in the Middle: How Language Models Use Long ContextsLiu et al. (2023)

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

Tokens & context windows in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

A token is not a word

A model reads tokens, sub-word units. A short word may be one token; a long or rare word splits into several. One word does not always equal one token, and spaces and punctuation count too.

How tokenization works (BPE)

The split is learned, not hand-written. Byte-pair encoding (BPE) starts from small pieces and repeatedly merges the most frequent adjacent pair into a new token. This lets a model build rare words from known pieces. Different tokenizers split the same text differently.

Why tokens matter

Usage and cost are billed per token (input + output), and every model has a token limit. The same idea can take more tokens in some languages than in English, so it can cost more and fill the window faster.

The context window

The context window is how much text a model can consider at once, measured in tokens: its working memory. Your prompt and the reply share this budget. It is not long-term memory: only what's inside it influences the answer.

When you run out of room

Exceeding the window causes truncation: text that doesn't fit is dropped. In long contexts, models often use the start and end better than the middle ("lost in the middle"). To fit large material, summarize, or use retrieval (RAG) to include only the most relevant chunks.

Use it wisely

When something gets expensive or gets cut off, suspect token count. Tighten your text, count tokens with a real tokenizer tool, and put the most important information where the model will actually use it.

Gallery

Contacts

What are tokens and the context window?

01A token is not a word

02How tokenization works: byte-pair encoding

03Why tokens matter: cost, limits & other languages

04The context window: the model's working memory

05When you run out of room: truncation, RAG & long documents

06Check your understanding

You finished Tokens & Context Windows

The Attention Mechanism (Deep Dive)

How LLMs work (tokens)

Generative AI

Transformers

Tokens & context windows in 5 minutes

A token is not a word

How tokenization works (BPE)

Why tokens matter

The context window

When you run out of room

Use it wisely

Services

Learn

Company