Language lesson

Track 02 · Language Intermediate ~9 min

Teaching a model what people actually prefer

Pretraining teaches a model to predict the next word. It doesn't teach the model to be helpful, honest, or safe. RLHF — reinforcement learning from human feedback — closes that gap by turning human preferences into a reward signal the model can be optimized against. Walk the three-stage pipeline below, train a reward model by ranking responses yourself, then watch the policy shift.

Module progress

01Why next-word prediction isn't enough

A base language model is trained on one job: predict the next token across a huge pile of internet text. That makes it fluent, but fluent isn't the same as useful. Ask a base model a question and it may continue with more questions, drift off-topic, or produce confident nonsense — because imitating text is not the same as following your intent. RLHF is the technique that re-aims the model from "what text usually comes next" toward "what a person would actually prefer as a response." It does this by collecting human judgments about which outputs are better, turning those judgments into a numeric reward, and then optimizing the model to earn more of that reward.

The goal is alignment with human preferences — making outputs more helpful, honest, and harmless, and better at following the user's intent than pretraining alone (Ouyang et al., InstructGPT).
Strikingly, InstructGPT showed a 1.3B-parameter RLHF-tuned model could be preferred by labelers over the 175B-parameter GPT-3 on their prompt distribution — alignment, not just size, drives perceived quality.
Those preference results are measured on specific prompts and labeler pools; treat them as evidence of better alignment, not a universal accuracy guarantee.

02The three-stage pipeline, step by step

The canonical RLHF recipe has three stages: (1) supervised fine-tuning on example demonstrations, (2) train a reward model from human comparisons between candidate responses, and (3) policy optimization — usually with PPO — that nudges the model toward higher-reward outputs while a KL penalty keeps it from drifting too far. Step through each stage. In stage 2 you'll rank responses yourself to teach the reward model; in stage 3 you'll run optimization steps and watch the model's output distribution shift toward what you preferred.

InteractivePick a stage, then act

A simpler alternative — DPO. Direct Preference Optimization (Rafailov et al., 2023) skips the separately trained reward model and the online RL loop entirely. It derives a closed-form relationship that lets the policy be trained directly on the preference pairs with a simple classification-style loss — the same human comparisons, far less moving machinery. Many recent systems use DPO or related ranking-loss methods instead of, or alongside, classic PPO-based RLHF.

03The reward model: turning preferences into a number

Reinforcement learning needs a reward — a single number that says how good an action was. But "how good is this paragraph?" has no obvious score. The trick is to not ask for a score at all. Instead, humans are shown two (or more) candidate responses to the same prompt and simply pick which one they prefer. A reward model is then trained on thousands of these comparisons to predict which response a human would choose — and in doing so it learns to output a scalar reward for any response. The idea predates LLMs: Christiano et al. (2017) learned reward models from human preferences over pairs of trajectory segments for control tasks, asking for feedback on well under 1% of the agent's interactions.

Humans give comparisons, not scores — "A is better than B" is far more reliable and consistent than asking people to rate things 1–10.
The reward model generalizes the human's taste: once trained, it can score brand-new responses the labelers never saw.
The reward model is imperfect. It can be gamed (reward hacking) and it inherits the values and biases of the specific labeler pool — its preferences are relative to those raters, not an objective standard.

04Optimization: chase reward, but stay on a leash

With a reward model in hand, the final stage uses reinforcement learning to adjust the model's weights so its outputs earn higher reward. The most common algorithm is Proximal Policy Optimization (PPO) (Schulman et al., 2017), a clipped policy-gradient method that makes cautious, bounded updates instead of large risky ones. There's a catch: if you optimize for the reward model alone, the policy can drift into weird, degenerate text that scores high on the imperfect reward but reads badly — classic reward over-optimization. The fix is a KL-divergence penalty against a frozen reference model (usually the SFT model), introduced for language-model fine-tuning by Ziegler et al. (2019). It acts as a leash: chase reward, but don't wander too far from sensible language.

Generate

Policy answers a prompt

The current model produces a candidate response.

→

Score

Reward model rates it

The trained reward model assigns a scalar reward.

→

Update

PPO nudges weights

Bounded update toward higher reward, minus a KL penalty for drifting from the reference.

PPO updates are clipped / bounded — small, careful steps rather than big jumps that could destabilize the model.
The KL penalty trades reward against staying close to the reference model; tuning that balance is central to stable RLHF training.
RLHF is known to be complex and sometimes unstable, and an "alignment tax" can regress performance on some tasks — present it as a powerful but delicate technique, not a free win.

05Beyond the textbook recipe

The SFT → reward model → PPO pipeline is the historical reference, but production systems mix and match. DPO drops the reward model and RL loop, optimizing preferences directly. RLAIF / Constitutional AI (Bai et al., 2022, Anthropic) substitutes AI-generated feedback for some human harm labels: a model critiques and revises its own responses against a written set of principles (a "constitution"). Lighter-weight options like reward-ranked / best-of-n fine-tuning (RAFT) and ranking-loss methods (RRHF) avoid a full online RL loop while still using preference signals. The constant across all of them is the core idea: human (or human-derived) preferences shape the model's behavior.

DPO — preference pairs, no reward model, no RL loop; a simpler, increasingly common path.
RLAIF / Constitutional AI — AI feedback against written principles replaces some human labels, scaling the feedback step.
RAFT / RRHF — reward-ranked or ranking-loss alignment that sidesteps full online RL; treat all of these as active alternatives, not a single settled "best" method.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"RLHF in one page" — printable summary

The whole lesson distilled to a cheat-sheet you can save as PDF.

▸ Related lessons — keep building the picture

Live lesson

Fine-tuning, explained

Supervised fine-tuning is stage 1 of RLHF — this lesson zooms in on what fine-tuning is and when to use it.

Read →

Live lesson

Supervised vs unsupervised vs reinforcement learning

RLHF is reinforcement learning at its core — ground the "RL" half of the acronym here.

Read →

▸ Coming next — deeper progression

Coming soon

Instruction tuning & alignment tuning

How demonstrations and preferences combine to make a model follow instructions.

Coming soon

Chain-of-thought & reasoning prompting

Once a model is aligned, how prompting elicits step-by-step reasoning.

Coming soon

⊕Concept map

The whole lesson at a glance — expand each branch to see the key ideas that sit under it.

Why next-word prediction isn’t enough

Pretraining only teaches a model to predict the next token — fluent, but not the same as helpful, honest, or intent-following.
RLHF re-aims the model from “what text usually comes next” toward “what a person would actually prefer.”
InstructGPT showed a 1.3B RLHF-tuned model could be preferred over the 175B GPT-3 on the labeler prompt distribution.
Those preference wins are tied to specific prompts and labeler pools — alignment evidence, not a universal accuracy guarantee.

The three-stage pipeline

Stage 1 — supervised fine-tuning (SFT) on demonstration data.
Stage 2 — train a reward model from human comparisons between candidate responses.
Stage 3 — policy optimization (usually PPO) toward higher reward, with a KL penalty keeping it near a frozen reference.
DPO is a simpler alternative that skips the separate reward model and online RL loop, training directly on preference pairs.

The reward model

Humans give comparisons (“A is better than B”), not 1–10 scores — far more reliable and consistent.
The model learns to output a scalar reward for any response and generalizes the labelers’ taste to brand-new outputs.
The idea predates LLMs: Christiano et al. (2017) learned reward models from human preferences over trajectory pairs.
It is imperfect — it can be gamed (reward hacking) and inherits the values and biases of its labeler pool.

Optimization, PPO & the KL leash

PPO (Schulman et al., 2017) makes cautious, clipped policy updates rather than large risky ones.
A KL-divergence penalty against a frozen reference (Ziegler et al., 2019) keeps the policy from drifting too far.
Without the leash, the policy can reward-hack into degenerate high-scoring text — reward over-optimization.
RLHF is powerful but delicate: training can be unstable and an “alignment tax” can regress some tasks.

Beyond the textbook recipe

DPO — optimizes preference pairs directly, no reward model and no RL loop.
RLAIF / Constitutional AI (Bai et al., 2022, Anthropic) — AI feedback against written principles replaces some human harm labels.
RAFT and RRHF — reward-ranked or ranking-loss methods that sidestep a full online RL loop.
The constant across all of them: human (or human-derived) preferences shape the model’s behavior.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established techniques and is grounded in the primary references below; the pipeline interactive uses invented, illustrative reward and probability values to make the mechanics visible — they are labelled as such and are not the output of any specific model.

Training language models to follow instructions with human feedback (InstructGPT) — Ouyang et al. (OpenAI)
Deep Reinforcement Learning from Human Preferences — Christiano et al.
Proximal Policy Optimization Algorithms — Schulman et al. (OpenAI)
Fine-Tuning Language Models from Human Preferences — Ziegler et al. (OpenAI)
Learning to summarize from human feedback — Stiennon et al. (OpenAI)
Direct Preference Optimization (DPO) — Rafailov et al. (Stanford)
Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic)
Illustrating Reinforcement Learning from Human Feedback (RLHF) — Lambert et al. (Hugging Face)

Responsible use & transparency

This is an educational explainer, not professional advice. The pipeline visualization uses invented, illustrative reward scores and output probabilities to make the mechanics visible — they are not measurements from any specific model, and exact algorithms, hyperparameters, and recipes differ by provider and change over time. Verify against the current primary literature and official documentation before relying on any detail.

RLHF aligns a model with the preferences of a specific group of human raters; it does not make a model objectively correct, and it can inherit those raters' biases or be undermined by reward hacking. AI systems can produce plausible-sounding but incorrect output. For decisions in medical, legal, financial, or other high-stakes domains, always consult a qualified professional and verify AI output against authoritative sources. See the NIST AI Risk Management Framework for responsible-AI guidance.

RLHF — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What it is

RLHF (Reinforcement Learning from Human Feedback) aligns a model with human preferences: human judgments are turned into a reward signal, and the model is optimized to earn more of that reward. Goal: more helpful, honest, and harmless output that follows user intent than pretraining alone.

The three stages

1. Supervised fine-tuning (SFT) — fine-tune the base model on example demonstrations. 2. Reward model — humans compare candidate responses; a reward model learns to predict the preferred one and output a scalar reward. 3. Policy optimization (PPO) — reinforcement learning nudges the model toward higher reward, with a KL penalty against a frozen reference model to prevent drift and reward hacking.

The reward model

Humans give comparisons ("A is better than B"), not absolute scores — more reliable. The reward model generalizes that taste to new responses. It is imperfect: it can be gamed and reflects its labelers' biases. The idea predates LLMs (Christiano et al., 2017).

Alternatives

DPO drops the reward model and RL loop, optimizing preference pairs directly. RLAIF / Constitutional AI uses AI feedback against written principles. RAFT / RRHF use reward-ranked or ranking-loss alignment without a full online RL loop. The SFT → reward model → PPO recipe is the historical reference; production systems mix and match.

Caveats

RLHF is powerful but complex and sometimes unstable, susceptible to reward hacking and to an "alignment tax" on some tasks. It aligns to specific raters, not an objective standard.

Gallery

Contacts

Teaching a model what people actually prefer

01Why next-word prediction isn't enough

02The three-stage pipeline, step by step

03The reward model: turning preferences into a number

04Optimization: chase reward, but stay on a leash

05Beyond the textbook recipe

06Check your understanding