Language learning lesson

Track 02 · Language Intermediate ~8 min

Why a tuned model follows your instruction — and a base model just keeps typing

A freshly pretrained language model is a world-class text-continuer, not an assistant: ask it a question and it may just write more questions. Instruction tuning and alignment tuning are the steps that turn that raw predictor into something that follows what you actually asked. See the difference for yourself below.

Module progress

01The gap: what pretraining actually teaches

A large language model is first pretrained on a huge amount of text with one simple objective: predict the next token. That objective makes it extraordinarily good at continuing text in a statistically plausible way — but "continue this text" is not the same job as "do what I'm asking." If the most likely continuation of your question is another question (because that's how Q&A pages and forums are written), a raw base model may happily produce more questions instead of an answer.

Instruction tuning exists to close that gap. As one survey puts it, instruction tuning bridges the gap between the model's next-word pretraining objective and a user's objective of having the model follow natural-language instructions (Zhang et al., survey). The model already knows a great deal from pretraining; tuning teaches it to act on what you ask.

Pretraining objective: predict the next token — produces a fluent text-continuer, not an instruction-follower.
User objective: "answer this," "summarize this," "translate this" — a different job from "continue this."
Instruction tuning is the bridge: it re-points the same model at following instructions.

02What instruction tuning is: SFT on (instruction, response) pairs

At its core, instruction tuning is supervised fine-tuning (SFT) of a pretrained model on a collection of tasks phrased as natural-language instructions paired with good responses — many (instruction, output) examples. The model keeps the same next-token machinery, but now the examples it trains on look like "Here is an instruction; here is the kind of response that follows it." Vendor tooling describes SFT exactly this way: training on input–output pairs to adapt a model's style and behaviour.

The crucial finding behind the technique: if you fine-tune across many different tasks written as instructions, the model doesn't just memorize those tasks — it gets better at following instructions for tasks it never saw in training. That generalization to held-out tasks is what made instruction tuning a turning point (FLAN; T0; Super-NaturalInstructions).

Instruction tuning = supervised fine-tuning on (instruction, response) pairs, on top of a pretrained model.
The payoff is zero-shot generalization: better at unseen tasks, not just the ones it was tuned on.
It surfaces capability the base model already has — it doesn't teach the model the world from scratch.

03See it: base model vs. instruction-tuned model

Here's the difference made concrete. Pick a prompt and watch two models respond to the same input: a BASE model that has only been pretrained (it just continues the text), and the INSTRUCTION-TUNED version of the same model (it treats your text as an instruction to follow). The outputs below are illustrative — stylized to show the behaviour pattern, not transcripts from a specific model.

InteractivePick a prompt · press Run

The prompt both models receive

Base model · pretrained only

Objective: predict the next token. It continues the text.

Instruction-tuned · same model + SFT

Objective: follow the instruction. It does what you asked.

▸ What produced the difference

1 · Pretrain

Learn language by predicting the next token on huge text. Result: a fluent continuer.

2 · SFT on instructions

Fine-tune on many (instruction, response) pairs — e.g. mixtures like FLAN or Super-NaturalInstructions.

3 · Follows instructions

Same weights, re-pointed: the model now treats input as a task to do, and generalizes to unseen tasks.

Same underlying model — the only difference is the SFT-on-instructions step in the middle.
The base model isn't "broken": continuing text is exactly what it was trained to do.
Large instruction mixtures (FLAN, Super-NaturalInstructions with 1,600+ tasks) are how that step is built at scale.

04The next layer: alignment tuning

Following instructions isn't the whole story. A model can follow an instruction and still produce something unhelpful, biased, or harmful. Alignment tuning is the stage — usually applied on top of instruction tuning — that nudges behaviour toward human preferences and values: be helpful, be honest, be harmless. Instruction tuning teaches the model to follow; alignment tuning shapes how it follows.

Two families of methods dominate. RLHF (reinforcement learning from human feedback) trains a reward model on human preference comparisons, then uses reinforcement learning to push the model toward outputs people prefer. In the InstructGPT work, combining SFT on human demonstrations with RLHF aligned the model so well that human raters preferred the outputs of a much smaller aligned model (1.3B InstructGPT) over those of a far larger model (175B GPT-3) — on the authors' own prompt distribution. Constitutional AI takes a different route: it uses a written set of principles (a "constitution") plus RL from AI feedback (RLAIF) to train a harmless assistant largely without human-labeled harmful examples.

Instruction tuning ≠ alignment tuning. The first teaches instruction-following; the second aligns behaviour with preferences and safety norms.
RLHF: reward model trained on human preferences → reinforcement learning toward preferred outputs.
Constitutional AI / RLAIF: a list of principles plus AI-generated feedback, reducing reliance on human harm labels.
Alignment reduces but does not eliminate harmful or incorrect output — it's an ongoing research area, not a solved guarantee.

05Where the tuning data comes from

Instruction tuning lives or dies on its data. Researchers have explored several ways to build the (instruction, response) pairs — and have found that how you design the data matters as much as how much of it you have. Switch between the main approaches below.

InteractiveSwitch the data approach

Curated multitask collections (FLAN, Super-NaturalInstructions)

Take many existing NLP datasets and re-phrase each as a natural-language instruction, then fine-tune across all of them. Scaling the number and diversity of tasks (plus model size, and adding chain-of-thought data) improves generalization to tasks the model never saw. Super-NaturalInstructions assembled 1,600+ such tasks.

instruction: "Classify the sentiment of this review as positive or negative."

scale: hundreds–thousands of distinct tasks, each as instructions

Self-generated data (Self-Instruct, Alpaca)

Instead of hand-writing every example, bootstrap them from a model's own generations: have a model produce candidate instructions and responses, filter out low-quality or near-duplicate samples, then fine-tune on what remains. This reduces reliance on human-written instruction data. Stanford's Alpaca used this style to fine-tune an open model on tens of thousands of demonstrations cheaply.

step 1: model generates candidate (instruction, response) pairs

step 2: filter for quality / dedupe → fine-tune on the survivors

Quality over quantity (LIMA)

LIMA argues that a small, carefully curated, diverse set — on the order of about 1,000 high-quality examples — can already produce strong instruction-following. The interpretation (the "superficial alignment" hypothesis): most capability is acquired during pretraining, and tuning mainly surfaces it rather than teaching it. More data isn't automatically better.

claim: ~1,000 excellent examples can rival much larger tuning sets

implication: pretraining holds the knowledge; tuning unlocks the behaviour

Deliberate data design (The Flan Collection)

Beyond raw task count, design choices drive results: balancing how many examples each task contributes, enriching tasks with variations, and mixing prompt formats — zero-shot, few-shot, and chain-of-thought. Getting the mixture right materially improves instruction-tuning quality.

levers: task balancing · enrichment · zero-shot / few-shot / CoT mix

takeaway: the recipe matters, not just the ingredient count

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Instruction & alignment tuning" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Related lessons on the site

Lesson

RLHF: Reinforcement Learning from Human Feedback

Go deeper on the alignment-tuning stage — reward models and how preferences become behaviour.

Open →

Lesson

Chain-of-Thought & Reasoning Prompting

Chain-of-thought data is one of the ingredients that makes instruction tuning generalize better.

Open →

▸ Coming next — deeper progression

Lesson

Decoding & Sampling (Temperature, Top-k, Top-p)

Once a model follows instructions, decoding settings shape how its responses come out.

Open →

Coming soon

Preference optimization (DPO & friends)

A lighter-weight alternative to full RLHF for aligning models to preferences.

Coming soon

⊕Concept map

The whole lesson at a glance — expand a branch to see the grounded points underneath it.

The gap pretraining leaves

Pretraining optimizes one objective: predict the next token — producing a fluent text-continuer, not an instruction-follower.
The user's objective ("answer this," "summarize this") is a different job from "continue this text."
A raw base model may reply to a question with more questions, because that is the plausible continuation.

What instruction tuning is

Supervised fine-tuning (SFT) of a pretrained model on many (instruction, response) pairs.
It bridges the next-word pretraining objective and the user's instruction-following goal.
The payoff is zero-shot generalization — better at unseen tasks, not just the tuned ones (FLAN, T0).

Base vs. instruction-tuned

Same underlying model — the only difference is the SFT-on-instructions step in the middle.
The base model isn't "broken": continuing text is exactly what it was trained to do.
Large instruction mixtures (FLAN, Super-NaturalInstructions with 1,600+ tasks) build that step at scale.

Alignment tuning

Applied on top of instruction tuning to align behaviour with human preferences and values.
RLHF: a reward model trained on human preferences, then reinforcement learning toward preferred outputs (InstructGPT).
Constitutional AI / RLAIF: written principles plus AI feedback, reducing reliance on human harm labels.
Alignment reduces but does not eliminate harmful or incorrect output.

Where the tuning data comes from

Curated multitask sets (FLAN, Super-NaturalInstructions): re-phrase existing tasks as instructions; more diverse tasks help.
Self-generated (Self-Instruct, Alpaca): bootstrap pairs from a model's own outputs, then filter.
Quality over quantity (LIMA): ~1,000 high-quality examples can suffice — the "superficial alignment" hypothesis.
Data design (Flan Collection): task balancing and mixing zero-shot / few-shot / CoT formats drives quality.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts grounded in the primary references below; the base-vs-tuned outputs in the interactive are illustrative and labelled as such, not transcripts from a specific model. Performance comparisons cited (e.g. a smaller aligned model preferred over a larger base) are from the original authors' own evaluations on their prompt distributions, not universal claims.

Instruction Tuning for Large Language Models: A Survey — Zhang et al.
Finetuned Language Models Are Zero-Shot Learners (FLAN) — Wei et al., Google
Scaling Instruction-Finetuned Language Models (Flan-T5 / Flan-PaLM) — Chung et al., Google
Multitask Prompted Training Enables Zero-Shot Task Generalization (T0) — Sanh et al., BigScience
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ Tasks — Wang et al.
Training language models to follow instructions with human feedback (InstructGPT) — Ouyang et al., OpenAI
Self-Instruct: Aligning LMs with Self-Generated Instructions — Wang et al.
LIMA: Less Is More for Alignment — Zhou et al., Meta
Constitutional AI: Harmlessness from AI Feedback — Bai et al., Anthropic
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning — Longpre et al., Google
Supervised fine-tuning (OpenAI API guide) — OpenAI
SFT Trainer (Hugging Face TRL documentation) — Hugging Face

Instruction tuning & alignment tuning — in 8 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

The gap pretraining leaves

Pretraining optimizes one thing: predict the next token. That makes a model a fluent text-continuer, not an assistant — ask a base model a question and it may just continue the text. Instruction tuning bridges the next-word objective and the user's goal of following instructions.

What instruction tuning is

Supervised fine-tuning (SFT) on many (instruction, response) pairs, applied to a pretrained model. Tuning across many tasks yields zero-shot generalization — better at tasks it never saw (FLAN, T0, Super-NaturalInstructions).

Alignment tuning

Applied on top of instruction tuning to align behaviour with human preferences and safety. RLHF: train a reward model on human preferences, then reinforcement-learn toward preferred outputs. Constitutional AI (RLAIF): written principles + AI feedback, fewer human harm labels. Alignment reduces but does not eliminate harmful or wrong output.

Where the data comes from

Curated multitask (FLAN, 1600+ tasks). Self-Instruct/Alpaca: bootstrap data from a model's own generations, then filter. LIMA: ~1,000 high-quality examples can suffice ('superficial alignment'). Flan Collection: data design — task balancing, enrichment, mixed zero-shot/few-shot/CoT — matters as much as quantity.

Before you rely on AI · responsible use

This lesson is an educational explainer about how language models are tuned. It is not professional advice. The interactive comparison shows illustrative behaviour patterns, not outputs from any specific named model.

Instruction-tuned and aligned models can still produce confident-sounding but incorrect or biased output — alignment reduces, but does not eliminate, that risk. For medical, legal, financial, or mental-health decisions, consult a qualified professional rather than relying on a model's response. If you are in distress, contact a local emergency service or a crisis line such as the 988 Suicide & Crisis Lifeline (US).

For responsible-AI practices and risk management, see the NIST AI Risk Management Framework. Tech Jacks Solutions maintains editorial independence; this lesson cites primary research and vendor product documentation as sources.

Gallery

Contacts

Why a tuned model follows your instruction — and a base model just keeps typing

01The gap: what pretraining actually teaches

02What instruction tuning is: SFT on (instruction, response) pairs