Agentic lesson

Track 03 · Agentic Intermediate ~8 min

Teaching models with data they made themselves

When real data is scarce, private, or expensive, teams increasingly train models on data that other models generated. It's how small open models learned to follow instructions on a shoestring budget — and it carries a quiet risk: feed a model too much of its own output and quality can spiral downward. Learn how synthetic data is made, why it works, and where it bites back.

Module progress

01What "synthetic data" actually means

Synthetic data is artificially generated data that imitates the statistical patterns and structure of real-world data, but is produced by a model or algorithm rather than collected from real records. Instead of scraping the web or paying people to write examples, you have a capable model manufacture the training examples — questions and answers, labelled images, tabular records — and then train another model on them.

Why bother? Three motivations show up again and again. Scarcity: there may not be enough real examples of the task you care about. Privacy: generating statistically similar data with no one-to-one link to real individuals lets teams work with healthcare or financial patterns without exposing real people — often paired with differential-privacy guarantees. Cost and speed: a model can produce thousands of examples in the time it takes to commission a handful from human annotators.

Made, not collected: the data comes from a generator that mimics real data's structure, not from real observations.
A means, not an end: synthetic data exists to train or evaluate another model — quality of the downstream model is the real test.
Privacy is a headline use case: statistically similar records with no direct mapping to real people (per Gretel, MOSTLY AI, and AWS guidance).

02The breakthrough: bootstrapping instructions from a model

The idea that made synthetic data famous for language models is Self-Instruct (Wang et al., 2022). You start with a small seed set — the paper used 175 hand-written tasks — and ask a base language model to generate new instructions, plus inputs and outputs for them. Invalid and near-duplicate samples are filtered out, and the surviving examples are used to fine-tune the same model. Reported result: vanilla GPT-3 improved by roughly 33% absolute on the Super-NaturalInstructions benchmark — a large jump from data the model essentially wrote for itself.

Stanford's Alpaca turned this into a recipe heard around the world: it generated about 52,000 Self-Instruct-style examples from a stronger teacher model (text-davinci-003) and fine-tuned LLaMA 7B for a 2023 reproduction cost the authors estimated at under ~$600. Vicuna took a sibling approach — fine-tuning on ~70K shared ChatGPT conversations and reporting (via an LLM-as-judge evaluation) roughly 90% of ChatGPT's quality for ~$300. Both are examples of distillation: transferring a bigger "teacher" model's behaviour into a smaller "student."

Self-Instruct: a model writes, filters, and learns from its own instruction data, starting from a tiny seed pool.
Distillation: rooted in Hinton, Vinyals & Dean (2015) — train a student on a teacher's "soft" outputs; the modern version uses a teacher's generated text.
Caveat: the 2023 cost and quality figures are point-in-time estimates and exclude the teacher's API and base-model pretraining costs.

03Making the data better: evolve, explain, then filter

Raw generated examples are a starting point, not a finished dataset. Several techniques push quality higher. Evol-Instruct (WizardLM) iteratively rewrites seed instructions into harder variants (in-depth) and more varied ones (in-breadth), synthesizing a complexity-graded set rather than a flat one. Orca uses "explanation tuning" — training a smaller model on detailed step-by-step reasoning traces from a teacher, so it imitates the reasoning process, not just the final answer. And Microsoft's phi models showed that curated, "textbook-quality" synthetic data can substitute for raw scale: phi-1 (1.3B parameters) reached 50.6% pass@1 on HumanEval trained partly on generated textbooks and exercises.

The other half of modern pipelines is filtering. A common pattern is generate-then-rank: an instruct model produces candidates, and a reward model scores them so weak samples are dropped. NVIDIA's Nemotron-4 340B pipeline pairs an Instruct generator with a Reward model that grades on helpfulness, correctness, coherence, complexity, and verbosity; NVIDIA reported that over 98% of the model family's alignment data was synthetically generated. Good synthetic data is as much about what you throw away as what you generate.

Evolve: rewrite instructions into harder and broader variants for a complexity-graded set (Evol-Instruct).
Explain: teach the reasoning trace, not just the answer (Orca's explanation tuning).
Filter: generate many, then rank/keep with a reward model — quality over raw quantity.

04See it work: generate, then watch it collapse

This interactive has two parts. First, a Self-Instruct-style generator: it expands a few seed tasks into many synthetic examples and filters the duplicates — the bootstrapping loop in miniature. Second, a model-collapse demonstrator: train on your own outputs over successive generations and watch the spread of the data narrow and quality decay — the "curse of recursion." Flip the mix in real data switch to see how keeping real examples in the loop slows the decline going forward — it's a preventative measure, so it anchors future generations but won't restore diversity already lost. The numbers and shapes are illustrative — they show the mechanism, not measured results from any specific model.

InteractiveRun each part

Seed tasks

Examples kept

Duplicates filtered

Seed pool ready. Press “Generate a round” to bootstrap new examples from the seeds.

A toy of Self-Instruct: real pipelines start from a small seed set (the paper used 175 tasks), prompt a model to write more, then drop near-duplicates and invalid samples before training. Counts here are illustrative.

Mix in real data

Generation

100%

Diversity left

full

Distribution tails

Generation 0: trained on real data. The curve shows the full spread, rare cases (the tails) included.

Illustrates model collapse (Shumailov et al., 2023; Nature 2024): recursively training on generated data can irreversibly lose the distribution's tails. Mixing in real data is preventative, not a cure — toggle it on and it slows the decline from the next generation forward, but the diversity already lost in earlier generations stays lost. Train a few generations with it off, then switch it on, to see the curve hold where it is rather than recover.

05Using synthetic data responsibly

Synthetic data is powerful, but it comes with strings attached. Model collapse is a real but debatable risk. The strongest collapse results assume fully recursive training that replaces real data each generation. Later work argues that mixing or accumulating real data alongside synthetic data, plus serious filtering, mitigates or avoids it. Treat collapse as a documented risk to manage — keep real data in the loop and watch your distributions — not as an inevitability.

There's also a legal and ethical dimension. Distilling from a proprietary teacher model (the way Alpaca and Vicuna used another provider's outputs) can violate that provider's terms of service — it's not purely a technical recipe. And surveys of synthetic data for language models (Google DeepMind, 2024) stress responsible criteria: factuality (don't generate confident falsehoods), fidelity (match the real distribution you care about), and unbiasedness (don't amplify the generator's biases). Vendor claims about fidelity and privacy should be read as vendor-reported; lean on peer-reviewed work for general mechanism claims.

Keep real data in the loop: the simplest, best-supported guard against collapse.
Mind the terms of service: generating training data from a closed model may breach its usage rules.
Check factuality, fidelity, and bias: generated data can be confidently wrong, off-distribution, or skewed.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Synthetic data generation" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

Instruction tuning & alignment tuning

Where most synthetic instruction data is actually used — turning a base model into one that follows instructions.

Read →

Live lesson

Fine-tuning, explained

The training step that consumes synthetic datasets — and where distillation from a teacher pays off.

Read →

▸ Related — keep building the picture

Live lesson

RLHF: reinforcement learning from human feedback

The other major alignment data source — and the reward models that also filter synthetic data.

Read →

Coming soon

Evaluating synthetic datasets

How to measure factuality, fidelity, and bias before you train on generated data.

Coming soon

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established methods and is grounded in the references below; the figures in the interactive are illustrative and labelled as such, and 2023-era cost/quality numbers are point-in-time estimates.

Self-Instruct: Aligning Language Models with Self-Generated Instructions — Wang et al. (arXiv 2212.10560)
Alpaca: A Strong, Replicable Instruction-Following Model — Stanford CRFM
Vicuna: An Open-Source Chatbot Impressing GPT-4 — LMSYS Org
WizardLM: Evol-Instruct for Complex Instructions — Xu et al. (arXiv 2304.12244)
Orca: Progressive Learning from Complex Explanation Traces — Mukherjee et al. (arXiv 2306.02707)
Textbooks Are All You Need (phi-1) — Gunasekar et al. (arXiv 2306.11644)
Distilling the Knowledge in a Neural Network — Hinton, Vinyals & Dean (arXiv 1503.02531)
The Curse of Recursion: Training on Generated Data Makes Models Forget — Shumailov et al. (arXiv 2305.17493)
AI models collapse when trained on recursively generated data — Shumailov et al., Nature 631 (2024)
Best Practices and Lessons Learned on Synthetic Data for Language Models — Google DeepMind (arXiv 2404.07503)
NVIDIA Open Synthetic Data Generation Pipeline (Nemotron-4) — NVIDIA
Build an enterprise synthetic data strategy using Amazon Bedrock — AWS

⊕Concept map

A bird's-eye view of synthetic data — expand each branch to see the key ideas from this lesson.

What synthetic data is

Artificially generated data that mimics the statistical structure of real records, produced by a model or algorithm rather than collected.
A means, not an end: it exists to train or evaluate another model, so the downstream model's quality is the real test.
Three motivations recur: scarcity of real examples, privacy, and lower cost/faster production than human annotation.

Bootstrapping & distillation

Self-Instruct (Wang et al., 2022): a base LM generates instructions from a 175-task seed set, filters them, and fine-tunes itself — ~33% absolute gain on Super-NaturalInstructions.
Alpaca distilled ~52K examples from text-davinci-003 into LLaMA 7B for under ~$600 (2023 estimate).
Distillation roots back to Hinton, Vinyals & Dean (2015): train a student on a teacher's soft outputs.

Making the data better

Evolve: Evol-Instruct (WizardLM) rewrites seeds into harder and broader variants for a complexity-graded set.
Explain: Orca's explanation tuning trains on step-by-step reasoning traces, not just final answers.
Filter: generate-then-rank pipelines drop weak samples; NVIDIA's Nemotron-4 340B reported >98% synthetic alignment data.

Model collapse

The "curse of recursion" (Shumailov et al., 2023; Nature 2024): recursively training on generated data can irreversibly lose the distribution's tails.
The strongest results assume fully recursive training that replaces real data each generation.
Mixing or accumulating real data is preventative, not a cure — it slows decline going forward but won't restore lost diversity.

Using it responsibly

Keep real data in the loop — the simplest, best-supported guard against collapse.
Mind the terms of service: distilling from a proprietary teacher model can breach its usage rules.
Check factuality, fidelity, and unbiasedness (Google DeepMind survey, 2024); treat vendor fidelity/privacy claims as vendor-reported.

Responsible use & transparency

This is an educational explainer, not professional or legal advice. The interactive figures are simplified illustrations of how synthetic-data generation and model collapse behave — they are not measured results from any specific model or product. Specific cost, benchmark, and quality numbers (for example Alpaca's ~$600, Vicuna's "90% of ChatGPT," and phi-1's 50.6% HumanEval) are point-in-time 2023 figures measured with the methods of that period, including LLM-as-judge evaluation, which has known biases; treat them as historical anchors, not current results.

Generating training data from a closed, proprietary model may violate that provider's terms of service. Synthetic data can also be confidently incorrect, off-distribution, or biased — verify factuality, fidelity, and bias before training on it, and keep real data in the loop to guard against model collapse. For guidance on managing these risks, see the NIST AI Risk Management Framework.

Synthetic data generation — in 8 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What it is

Artificially generated data that imitates real data's statistics and structure, made by a model/algorithm rather than collected. Used when real data is scarce, private, or expensive. Privacy is a headline use case: statistically similar records with no one-to-one link to real people.

Bootstrapping & distillation

Self-Instruct: a model generates instructions/inputs/outputs from a small seed set (175 tasks), filters them, and fine-tunes itself. Alpaca fine-tuned LLaMA 7B on ~52K such examples from a teacher model (~$600, 2023). Distillation transfers a teacher model's behaviour into a smaller student.

Making data better

Evol-Instruct rewrites instructions into harder/broader variants. Orca trains on reasoning traces, not just answers. phi showed curated "textbook-quality" synthetic data can substitute for scale. Then filter: generate-then-rank with a reward model keeps the strong samples.

Model collapse

Recursively training on generated data can irreversibly lose the distribution's tails and degrade quality (Shumailov et al., 2023; Nature 2024). Worst under fully recursive training; mixing in real data and filtering mitigates it.

Use it responsibly

Distilling from a proprietary teacher may breach its terms of service. Check factuality, fidelity, and bias; keep real data in the loop.

Gallery

Contacts

Teaching models with data they made themselves

01What "synthetic data" actually means

02The breakthrough: bootstrapping instructions from a model

03Making the data better: evolve, explain, then filter

04See it work: generate, then watch it collapse

05Using synthetic data responsibly

06Check your understanding