RAG: giving a model your sources
Retrieval-Augmented Generation fixes an LLM's habit of answering from fuzzy memory — it fetches relevant documents first, then hands them to the model alongside your question so the answer is grounded in real sources. Learn the pipeline, the building blocks, and when to reach for RAG instead of fine-tuning, right here on the page.
01The problem RAG solves
Imagine asking a brilliant friend a question, but they can only answer from memory — they aren't allowed to look anything up. That is how an AI chatbot normally works: it replies from fuzzy memory, the patterns it picked up while being trained. That memory is frozen at a cutoff date, can't see your private documents, and will sometimes produce a confident but wrong answer (a hallucination). Retrieval-Augmented Generation (RAG) is simply letting that friend open the right book first: it fetches the relevant documents and hands them to the model alongside your question. The model then answers grounded in those real sources — so it can stay current, point back to where the answer came from, and make things up far less often.
- RAG adds knowledge at answer-time — no retraining the model.
- Answers become traceable to sources you control, instead of opaque memory.
- Update the knowledge base, and the next answer reflects it immediately.
02Run the pipeline
RAG is a five-stage pipeline: your question goes in, the system searches a knowledge base, retrieves the most relevant chunks, augments the prompt by adding those chunks, and the model generates a grounded answer. Step through it, then compare a model answering with and without RAG.
✕ Without RAG
The model answers from memory. If your policy changed last week, it won't know — and may invent a confident, wrong answer.
✓ With RAG
The model reads your actual policy doc first, then answers — current, specific, and traceable to the source.
03The building blocks
Under the hood, four pieces make retrieval work. Documents are split into chunks; each chunk is turned into an embedding and stored in a vector store; a retriever finds the chunks closest to your question, and hands them to the model (the generator). Tap each block to see what it does.
Chunking
Long documents are split into smaller, self-contained passages called chunks — often a few paragraphs each. Smaller chunks make retrieval more precise (you fetch just the relevant bit), and they keep the added context short enough to fit in the prompt.
04RAG vs fine-tuning
People often confuse these two ways of giving a model new abilities. The short version: RAG adds knowledge at answer-time, while fine-tuning changes the model's behavior and style. They solve different problems — and can be combined.
RAG — adds knowledge at answer-time
Keeps the model unchanged and feeds it relevant documents when a question is asked. Best when facts change often or live in your own sources. Easy to update — just change the documents — and answers can cite where they came from.
Fine-tuning — changes behavior & style
Continues training the model on examples so it learns a way of responding — a tone, a format, a specialized skill. It bakes patterns into the model's weights. Heavier to do, and not the right tool for facts that change quickly, since updating means training again.
Which to use
Reach for RAG when the gap is knowledge: the model needs current or private facts it was never trained on. Reach for fine-tuning when the gap is behavior: the model knows enough but should respond in a particular style or format. Many real systems use both — fine-tune for behavior, RAG for facts.