Language lesson

Track 02 · Language Intermediate ~8 min

Context engineering: packing the window that drives the model

A model only "sees" what fits inside its context window — a finite budget of tokens. Context engineering is the craft of deciding what goes in that window and in what order: instructions, examples, retrieved documents, tool results, and chat history. Pack it well and the model shines; overstuff it and useful information gets squeezed out — or quietly ignored. Try it right here on the page.

Lesson progress

01What context engineering actually means

When you use a chat model, it doesn't have memory of your whole project or access to the open web by default. It only responds to the text placed in front of it for that one request — the context window. Context engineering is the practice of curating, structuring, and maintaining everything that occupies that window during inference so the most useful signal is present and the model behaves the way you want. Anthropic frames it as the natural next step after prompt engineering: prompt engineering is mostly about writing the instructions, while context engineering manages the full set of tokens present when the model runs — not just the prompt.

The window can hold many things at once: system instructions, few-shot examples, retrieved documents, tool definitions and results, and the running conversation.
Prompt engineering ≈ wording the instructions; context engineering ≈ managing everything in the window at inference time.
The goal is not "more text" — it's the right text, in the right order, within a fixed budget.

02The window is a finite budget

Everything you put in the context window is measured in tokens — roughly, chunks of text. The window has a fixed size, so every component competes for the same limited space. A long system prompt leaves less room for retrieved documents; a giant pasted document leaves less room for the conversation. When the total exceeds the limit, something has to give: content gets truncated (cut off) before the model ever sees it. That is why relevance and ordering matter so much — you are spending a budget, and what you cut is just as important as what you keep. The interactive below makes the trade-off concrete.

InteractiveDrag the sliders, watch the budget

System prompt & instructions 800

Role, tone, rules. Usually kept near the top of the window.

Few-shot examples 1500

Worked examples the model learns the task from in-context.

Retrieved documents 6000

Passages pulled in by retrieval (RAG) to ground the answer.

Chat history 4000

The running conversation so far. Grows every turn.

Window limit is fixed at 16,000 tokens for this demo. Tighten retrieval keeps only the most relevant passages (illustrative); summarize history compresses old turns to free budget (illustrative). The reduction amounts here are for demonstration, not measured ratios.

Window fill 12,300 / 16,000

System Examples Docs History

Adjust the sliders to see what fits.

Push the total over 16,000 and the demo truncates the lowest-priority content first — it never reaches the model.
Tighten retrieval and summarize history are real techniques: they shrink components so the important signal still fits.
Bigger windows help, but they are still finite — the budgeting problem doesn't disappear, it just moves.

03"Lost in the middle": where you place it matters

Fitting information into the window is only half the battle — where it sits also changes how well the model uses it. A widely cited study, Lost in the Middle (Liu et al., 2023), found that language models tend to use information best when it appears at the beginning or end of a long-context input (shown in retrieval/QA studies), and show degraded recall for information buried in the middle. Performance can also fall as the context gets longer overall. Anthropic describes a related tendency it calls "context rot": as the number of tokens grows, the model's ability to accurately recall any given detail tends to decrease — sometimes before the hard limit is even reached. These are empirical tendencies, not iron laws — they differ by model and task, and some newer models mitigate them — but they explain a lot of practical advice.

Put the most important material near the top or bottom of a long context, not buried in the middle.
More tokens is not automatically better: a longer, noisier window can lower recall of the detail you actually care about.
Vendor tip: place long documents near the top, wrap distinct inputs in tags/metadata, and ask the model to quote the relevant passages first before answering.

04The ingredients you're budgeting

Three of the most important things that compete for window space are worth understanding on their own, because each is a lever you can pull.

In-context learning & few-shot examples. Introduced with GPT-3 (Brown et al., 2020), in-context learning lets a model perform a task from a few examples placed in the prompt — without updating its weights. Add a few worked examples and the model imitates the pattern. Chain-of-thought prompting (Wei et al., 2022) is a related move: include exemplars that show intermediate reasoning steps, and complex reasoning improves. Both spend tokens to buy behaviour.

Retrieval-augmented generation (RAG). Rather than rely only on what's baked into the weights, RAG (Lewis et al., 2020) retrieves relevant passages at inference time and adds them to the context, combining parametric memory (weights) with non-parametric memory (retrieved text) to improve factuality. Adaptive variants like Self-RAG let the model decide when to retrieve and judge whether a passage is even relevant — controlling what actually enters the window.

Long context vs. retrieval. Some models now offer very large windows — Google documents 1M+ token windows for Gemini — which can sometimes replace a separate retrieval pipeline for "chat with your data." But long context and retrieval are complementary, not rivals: a big window still benefits from good selection and ordering, and retrieval keeps the window focused.

05Techniques that stretch the budget

When the useful content won't fit, you have options beyond "delete things." These are the workhorses of context engineering — and several appeared as toggles in the interactive above.

Prompt compression. Methods like LLMLingua / LongLLMLingua reduce a prompt's token count while preserving its key signal, lowering cost and latency — and, in long-context settings, the LongLLMLingua work reports it can help mitigate position bias.
History summarization. Instead of carrying every past turn verbatim, compress older conversation into a short summary, freeing budget for new, relevant content.
Memory management. Agent memory systems (MemGPT-style tiered/paged memory) virtually extend the usable window by swapping information in and out as needed, like an operating system paging memory.
Adaptive retrieval. Self-RAG-style approaches retrieve only when needed and filter for relevance, so retrieved text doesn't crowd out everything else.
Bigger windows under the hood. Positional methods (RoPE interpolation, LongRoPE) extend the trained length, and systems techniques like Ring Attention serve very long sequences across devices — but figures from these papers are model- and dataset-specific, so treat reported numbers as illustrative, not universal.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Context engineering" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Related lessons — go deeper

Lesson

Chain-of-thought & reasoning prompting

A context-structuring move: include reasoning-step exemplars to improve complex reasoning.

Open →

Lesson

The Model Context Protocol (MCP), explained

A standard way to feed tools and data into a model's context — context engineering in practice.

Open →

▸ Coming next — deeper progression

Coming soon

The attention mechanism (deep dive)

How position and attention decide what the model focuses on inside the window.

Coming soon

Agent memory architectures

Tiered and paged memory that virtually extends the usable context window.

Coming soon

→Continue learning

⊕Concept map

The whole lesson on one page — expand each branch to see how context engineering fits together.

What context engineering actually means

Curating, structuring, and maintaining everything in the model’s context window at inference — not just the prompt.
The window holds instructions, few-shot examples, retrieved documents, tool definitions/results, and conversation history.
Anthropic frames it as the natural next step after prompt engineering: prompts word the instructions; context engineering manages the full token set.

The window is a finite budget

Everything is measured in tokens, and the window has a fixed size, so every component competes for limited space.
When the total exceeds the limit, lower-priority content is truncated before the model ever sees it.
Relevance and ordering matter: what you cut is as important as what you keep.

“Lost in the middle”: where you place it matters

Liu et al. (2023) found models use information best at the beginning or end of a long input, with degraded recall for content buried in the middle.
Anthropic’s “context rot”: as token count grows, accurate recall of any given detail tends to decrease — sometimes before the hard limit.
These are empirical tendencies that vary by model and task, not absolute laws.

The ingredients you’re budgeting

In-context learning (GPT-3, Brown et al. 2020): perform a task from a few in-prompt examples, with no weight updates.
Retrieval-augmented generation (Lewis et al. 2020): inject retrieved passages at inference, combining parametric and non-parametric memory.
Long context vs. retrieval are complementary — big windows (1M+ tokens for some models) still benefit from good selection and ordering.

Techniques that stretch the budget

Prompt compression (LLMLingua / LongLLMLingua): cut token count while preserving key signal, and mitigate position bias in long contexts.
History summarization and memory management (MemGPT-style tiered/paged memory) free budget by compressing or paging information in and out.
Adaptive retrieval (Self-RAG) retrieves only when needed and filters for relevance, so retrieved text doesn’t crowd out everything else.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below. Figures shown in the interactive (token amounts, the 16,000-token limit) are illustrative for teaching and labelled as such; vendor- and paper-reported numbers vary by model, version, and workload and are attributed to their source.

Effective context engineering for AI agents — Anthropic Engineering
Lost in the Middle: How Language Models Use Long Contexts — Liu et al. (2023)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al. (2020)
Language Models are Few-Shot Learners (GPT-3) — Brown et al. (2020)
Chain-of-Thought Prompting Elicits Reasoning in LLMs — Wei et al. (2022)
LLMLingua: Compressing Prompts for Accelerated Inference — Jiang et al. (2023)
MemGPT: Towards LLMs as Operating Systems — Packer et al. (2023)
Self-RAG: Learn to Retrieve, Generate, and Critique — Asai et al. (2023)
Long context | Gemini API — Google AI for Developers
Long context prompting tips — Anthropic / Claude Docs

Responsible use

This is an educational explainer, not professional advice. AI systems can produce plausible-sounding but incorrect output, and context-handling behaviours such as "lost in the middle" and "context rot" are empirical tendencies that differ across models, versions, and tasks. Context-window sizes and capabilities change frequently — verify current limits against live vendor documentation before relying on specific numbers. For decisions with real consequences, confirm with primary sources and qualified professionals.

Context engineering — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What it is

Context engineering is curating, structuring, and maintaining everything in the model's context window at inference — instructions, few-shot examples, retrieved documents, tool results, and chat history — so the most useful signal fits the finite budget. It's framed as the next step after prompt engineering (which is mostly about wording instructions).

The window is a finite budget

The window has a fixed size measured in tokens. Every component competes for the same space; over budget, content is truncated and never reaches the model. Relevance and ordering matter because you're spending a budget.

Lost in the middle & context rot

Models tend to use information best at the beginning or end of a long input and worse in the middle (Liu et al., 2023). Anthropic's "context rot": recall accuracy tends to drop as token count grows. Put critical content near the top or bottom. These are tendencies, not absolute laws.

What goes in the window

In-context learning (GPT-3): learn a task from a few in-prompt examples, no weight updates. RAG (Lewis et al., 2020): retrieve passages at inference and add them to the context to improve factuality. Long context and retrieval are complementary.

Stretching the budget

Prompt compression (LLMLingua), history summarization, memory management (MemGPT-style paging), and adaptive retrieval (Self-RAG) all fit more useful signal into a fixed window. Bigger windows (RoPE interpolation, LongRoPE; Ring Attention to serve them) help but stay finite.

Gallery

Contacts

Context engineering: packing the window that drives the model

01What context engineering actually means

02The window is a finite budget

03"Lost in the middle": where you place it matters

04The ingredients you're budgeting

05Techniques that stretch the budget

06Check your understanding