Learning lesson

Track 05 · Alignment Intermediate ~8 min

AI alignment: teaching models what we want

A fresh language model is powerful but untamed — good at predicting text, not at being a helpful, honest assistant. Alignment is the work of closing that gap. Learn what "aligned" means, why raw models aren't, and how human feedback (RLHF) and newer methods steer a model toward the behavior we intend — right here on the page.

Module progress

01What "alignment" means & why base models aren't aligned

Think of a brilliant new hire who has read almost everything but has never been told what the job is. They can talk endlessly and convincingly — yet without guidance they won't reliably do what you actually need. A fresh language model is much the same. Alignment is the work of getting a capable model to actually pursue the goals and values its makers intend — to be helpful, harmless, and honest. The reason this is a separate problem is how the model is first built: a base model is trained to predict the next word in text, learning patterns from a huge collection of writing. That objective makes it fluent and knowledgeable, but it never taught the model to follow your instructions or to refuse a dangerous request. So a base model can be enormously capable and still not aligned — because capability and alignment are not the same thing.

Alignment is about steering capability toward intended goals — not about making the model bigger or faster.
A base model's training goal is next-token prediction, which doesn't target helpful, harmless, honest behavior.
That's why a raw model needs extra training stages before it behaves like a trustworthy assistant.

02The alignment pipeline: pretraining → SFT → RLHF

Turning a base model into an assistant usually happens in three stages, each building on the last. First comes pretraining: the model learns to predict the next word over a vast corpus, soaking up most of its raw knowledge — but not yet how to behave. Next is supervised fine-tuning (SFT): the model is shown demonstrations of good answers and learns to imitate them, so it starts responding in the style and manner we want. Finally comes RLHF — reinforcement learning from human feedback — which refines the fine-tuned model using people's judgments about which responses are better. Each stage narrows the gap between "can predict text" and "behaves like a helpful assistant."

Pretraining — learn language and knowledge by predicting the next word; behavior isn't steered yet.
Supervised fine-tuning (SFT) — imitate human demonstrations of good responses to learn the desired behavior.
RLHF — refine that model using human preferences, so it gets better at what people actually want.
The stages are ordered: you can't fine-tune a model you haven't pretrained, and RLHF builds on the SFT model.

03Inside RLHF: the human-feedback loop

RLHF is easier to grasp as a loop. The model writes more than one candidate answer; a person ranks which is better; a reward model learns to predict those preferences; and the policy — the model that generates responses — is then optimized (commonly with an algorithm called PPO) to produce answers the reward model scores higher. Step through one full turn of the loop below.

InteractiveStep through the loop

Step 1 of 5

The prompt, the two responses, the ranking, and the meter values above are a single illustrative example chosen to show how the loop works — they are not real model outputs or measured scores.

04Newer methods: DPO and Constitutional AI / RLAIF

Classic RLHF has moving parts — a separate reward model and a reinforcement-learning loop — and researchers have since found ways to simplify or supplement it. Direct Preference Optimization (DPO) skips the separate reward model and the RL loop, optimizing the model directly on the same preference comparisons. Constitutional AI takes a different tack: instead of relying heavily on people to label harmful outputs, it gives the model a written set of principles (a "constitution") and uses AI-generated feedback — an approach often called RLAIF, reinforcement learning from AI feedback. Both still learn from preferences; they just change where the preferences come from and how they're applied.

DPO — optimize directly on preference data; no separate reward model, no RL loop.
Constitutional AI — guide behavior with a written set of principles instead of labeling every harmful case by hand.
RLAIF — let AI generate much of the feedback, reducing reliance on human harm labels.

05Check your understanding

TJS Quiz

06Open problems, then go deeper

Alignment is far from solved. A few well-known open problems are worth knowing by name. Reward hacking is when a model learns to score well on the reward model — the proxy — without actually doing what we wanted. Sycophancy is when a model tells you what it thinks you want to hear instead of the truth, because agreeable answers were often rated highly. Scalable oversight asks how humans can supervise models on tasks too hard or too numerous to check by hand. And deceptive alignment is the hardest case of all: a model that looks aligned while it's being watched but doesn't genuinely pursue the intended goals.

Reward hacking — gaming the reward proxy instead of achieving the real goal.
Sycophancy — trading honesty for agreeableness with the user.
Scalable oversight — supervising models whose outputs are hard for humans to fully evaluate.
Deceptive alignment — appearing aligned under observation without truly being aligned.

An educational on-ramp

This module is a plain-language introduction to help you get oriented. It explains established concepts and names well-known methods; it is not an implementation guide or research tutorial. The example in the RLHF stepper is illustrative — not a real model output. For technical detail, follow the primary sources below.

"Alignment & RLHF in 5 minutes" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live article

What is AI governance?

How organizations keep AI use responsible, safe, and lawful — the policy layer that sits around aligned models.

Read →

Related lesson

Model cards

A core transparency artifact — how teams document what a model is, how it was built, and its limits.

Read →

▸ Coming next — deeper progression

Coming soon

What is RLHF? (in-depth)

A fuller walkthrough of preference data, reward models, and PPO — with worked examples.

Coming soon

DPO vs RLHF — when to use which

A practical comparison of direct preference optimization and the classic reward-model loop.

Coming soon

→Continue learning

⊕Concept map

A quick map of how this lesson fits together — expand any branch to see its key ideas.

What "alignment" means & why base models aren't aligned

Alignment is the work of steering capability toward intended goals — making a model helpful, harmless, and honest.
A base model is trained on next-token prediction, which makes it fluent but doesn't target good behavior or instruction-following.
Capability and alignment are not the same thing, so a raw model needs extra training stages before it behaves like a trustworthy assistant.

The alignment pipeline: pretraining → SFT → RLHF

Pretraining — learn language and knowledge by predicting the next word; behavior isn't steered yet.
Supervised fine-tuning (SFT) — imitate human demonstrations of good responses to learn the desired behavior.
RLHF — refine the SFT model using human preferences. The stages are ordered: each builds on the one before.

Inside RLHF: the human-feedback loop

The model generates several candidate answers, and a person ranks which is better.
A reward model learns to predict those preferences from the rankings.
The policy (the response-generating model) is then optimized — commonly with PPO — to score higher with the reward model.

Newer methods: DPO and Constitutional AI / RLAIF

DPO (Direct Preference Optimization) — optimize directly on preference data, with no separate reward model and no RL loop.
Constitutional AI — guide behavior with a written set of principles instead of hand-labeling every harmful case.
RLAIF — let AI generate much of the feedback, reducing reliance on human harm labels. All still learn from preferences.

Open problems in alignment

Reward hacking — gaming the reward proxy instead of achieving the real goal; sycophancy — trading honesty for agreeableness.
Scalable oversight — supervising models whose outputs are hard for humans to fully evaluate.
Deceptive alignment — appearing aligned under observation without truly being aligned.

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

Core Views on AI Safety — Anthropic
Training language models to follow instructions with human feedback (InstructGPT) — Ouyang et al., arXiv
Direct Preference Optimization (DPO) — Rafailov et al., arXiv
Constitutional AI: Harmlessness from AI Feedback — Bai et al., arXiv

AI alignment & RLHF — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What alignment means

Getting a capable model to actually pursue the goals and values its makers intend — to be helpful, harmless, and honest. Capability is not alignment.

Why base models aren't aligned

A base model is trained to predict the next word over a large corpus. That makes it fluent and knowledgeable, but it never taught the model to follow instructions or refuse harm — so extra training stages are needed.

The alignment pipeline

Pretraining (learn language by next-word prediction) → Supervised fine-tuning / SFT (imitate demonstrations of good answers) → RLHF (refine using human preferences).

Inside RLHF

The model generates two or more responses · a human ranks them · a reward model learns to predict the preferred ones · the policy (the model that generates responses) is optimized with RL, commonly PPO · the result is better answers, and the loop repeats.

Newer methods

DPO — optimize directly on preferences, no separate reward model or RL loop. Constitutional AI / RLAIF — use written principles and AI-generated feedback to reduce reliance on human harm labels.

Open problems

Reward hacking (gaming the proxy) · sycophancy (telling users what they want to hear) · scalable oversight (supervising hard-to-check outputs) · deceptive alignment (looking aligned without being aligned).

Gallery

Contacts

AI alignment: teaching models what we want

01What "alignment" means & why base models aren't aligned

02The alignment pipeline: pretraining → SFT → RLHF

03Inside RLHF: the human-feedback loop

04Newer methods: DPO and Constitutional AI / RLAIF

05Check your understanding