Foundations learning lesson

Track 01 · Foundations Novice ~7 min

What an AI model is really made of: its training data

A model is only as good as what it learned from. Before it can answer anything, it studies a huge pile of examples — and the quality of those examples quietly decides almost everything. Learn what training data is, why "garbage in, garbage out" rules everything, and what happens when AI starts learning from data made by other AI — right here on the page.

Module progress

01What training data actually is

Think of a model like a student who has never seen the world — the only thing it ever learns from is the pile of examples you hand it. That pile is the training data. If you want a model to recognise cats, you show it many pictures; if you want it to write, you show it lots of writing. The model studies those examples and slowly works out the patterns inside them. Nothing else teaches it. So the data isn't a side ingredient — it is the lesson.

There are two flavours of examples. Labelled data comes with the right answer already attached: a photo tagged "cat", an email marked "spam". Unlabelled data is just the raw examples with no answer attached — a pile of photos with nothing written on them. Labelled data is more expensive because a person usually has to add each label by hand, but it lets a model learn a clear input-to-answer mapping. The big idea to carry through this whole lesson: a model can only ever know what its training data showed it. Whatever is missing from the data tends to become a blind spot in the model.

Training data is the set of examples a model learns from — it's the model's entire education.
Labelled data has the correct answer attached to each example; unlabelled data does not.
A model's knowledge is bounded by its data — gaps in the data become blind spots in the model.

02Garbage in, garbage out

Here is the single most important idea about training data: "garbage in, garbage out." A model faithfully learns whatever is in its data — including the flaws. It will not notice that the data is biased, too small, or mislabelled and quietly fix it for you. It learns the problems right along with the patterns. Four things about the data shape what kind of model you end up with: its quality (how accurate and clean the examples are), its quantity (whether there are enough of them), its representativeness (whether it reflects the real situations the model will face), and the bias it carries (skews the model will learn and can even amplify). Toggle the data conditions below and watch the resulting model behaviour change.

InteractiveToggle the data conditions

Data condition

Add synthetic data

Mix in AI-generated examples (see Section 04).

Clean & representative data

Illustrative — shows the idea, not measured numbers

Garbage in, garbage out: a model learns the flaws in its data and won't fix them on its own.
Quality, quantity, representativeness, and bias in the data each shape what the model becomes.
Bias usually enters through the data — and a model can amplify a skew it was trained on.

03Where the examples come from: collecting, labelling & cleaning

Good training data rarely arrives ready to use — it gets collected, labelled, and cleaned first. Collection means gathering the raw examples: photos, text, recordings, records. Labelling (also called annotation) is the work of attaching the correct answer to each example — marking a photo as "cat", or tagging which part of a sentence is a name. A lot of this is done by people, one example at a time, which is slow and careful work; the model is only as trustworthy as those human judgments. For chat-style models there's a special kind of label: in RLHF — reinforcement learning from human feedback — people compare two model responses and mark which is better, and those preference judgments become the signal that shapes the model toward what humans actually want.

Before any of that data is used, teams clean it: they remove errors and noise, fix or drop broken examples, and deduplicate — strip out repeated copies. Deduplication matters more than it sounds: if the same example appears hundreds of times, the model treats it as far more common than it really is and over-learns it. Cleaning and dedup usually make a dataset smaller, but more accurate and more varied — which is exactly what makes a better model.

Labelling (annotation) attaches the correct answer to each example — often careful human work.
RLHF labels are human judgments of which response is better, used to align chat models with human preferences.
Cleaning and deduplication remove errors and repeated copies so the model learns from accurate, varied examples.

04Synthetic data: when AI makes its own fuel

Real-world data can be expensive, scarce, or sensitive — so teams increasingly use synthetic data: examples generated by AI rather than collected from the real world. It's appealing for a few clear reasons. Privacy — generated records can stand in for sensitive real data about people. Scarcity and edge cases — you can manufacture examples of rare situations that almost never show up in real data. And cost — generating examples can be cheaper than collecting and hand-labelling real ones.

But synthetic data carries real risks, and the headline one is model collapse. If models are trained on too much AI-generated output — output that itself came from AI — small errors compound across each generation. The model loses variety, forgets the rare cases, and slowly drifts away from reality, like a photocopy of a photocopy of a photocopy. A second risk is amplified bias: if the model generating the synthetic data is skewed, every example it produces inherits and concentrates that skew. Synthetic data is a powerful tool, but it works best alongside real data, not as a replacement for it.

Synthetic data is AI-generated training data, used for privacy, scarcity/edge cases, and cost.
Model collapse: training on too much AI output degrades a model — it loses variety and drifts from reality.
Synthetic data can amplify bias from the model that generated it; it's best used to complement real data.

05Where the data comes from: consent, copyright & governance

One question runs underneath everything else: where did the data actually come from? Training data is often drawn from real people and real creators — their photos, their writing, their records — and that raises genuine questions of consent (was it okay to use information about these people?) and copyright (was it okay to use work someone else created?). These aren't abstract concerns; they're at the centre of ongoing debate about how AI systems are built.

Data governance is the umbrella term for handling all of this responsibly: knowing a dataset's provenance (where it came from), respecting consent and licensing, and being accountable for how data is sourced and used. A foundational practice is simply documenting a dataset — recording what it contains, how it was collected, and what it's meant for — so the people who build on it can use it knowingly. You don't need to be a lawyer to take the core lesson: a model inherits not just the patterns in its data, but the responsibilities that came with collecting it.

Consent and copyright matter because training data often comes from real people and creators with rights.
Data governance covers provenance, consent, licensing, and accountability for how data is used.
Documenting a dataset's origin and intended use is a foundation of responsible, transparent AI.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Training data & synthetic data" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live article

What is machine learning? An introductory guide

See how a model turns all this training data into learned patterns in the first place.

Read →

▸ Coming next — deeper progression

Coming soon

What is bias in AI?

How skewed data turns into unfair outcomes — and what teams do about it.

Coming soon

What is data labelling?

A closer look at annotation, human feedback, and the people behind the labels.

Coming soon

→Continue learning

⊕Concept map

The whole lesson in one expandable tree — open a branch to see the key ideas under it.

What training data actually is

Training data is the set of examples a model learns from — its entire education.
Labelled data has the correct answer attached to each example; unlabelled data does not.
A model's knowledge is bounded by its data — gaps in the data become blind spots in the model.

Garbage in, garbage out

A model faithfully learns whatever is in its data — including the flaws — and won't fix them on its own.
Four properties shape the result: quality, quantity, representativeness, and bias.
Bias usually enters through the data — and a model can amplify a skew it was trained on.

Where the examples come from: collecting, labelling & cleaning

Labelling (annotation) attaches the correct answer to each example — often careful human work.
RLHF labels are human judgments of which response is better, used to align chat models with human preferences.
Cleaning and deduplication remove errors and repeated copies so the model learns from accurate, varied examples.

Synthetic data: when AI makes its own fuel

Synthetic data is AI-generated training data, used for privacy, scarcity/edge cases, and cost.
Model collapse: training on too much AI output degrades a model — it loses variety and drifts from reality.
Synthetic data can amplify bias from the model that generated it; it's best used to complement real data.

Where the data comes from: consent, copyright & governance

Consent and copyright matter because training data often comes from real people and creators with rights.
Data governance covers provenance, consent, licensing, and accountability for how data is used.
Documenting a dataset's origin and intended use is a foundation of responsible, transparent AI.

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

Datasheets for Datasets — Gebru, Morgenstern, Vecchione et al.
Data-Centric AI — overview — Data-Centric AI community (MIT)
Data Cascades in High-Stakes AI — Sambasivan et al. (Google Research)
The Curse of Recursion: Training on Generated Data Makes Models Forget (model collapse) — Shumailov, Shumaylov, Zhao et al.

Training data & synthetic data — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What training data is

The examples a model learns from — its entire education. Labelled data has the right answer attached to each example; unlabelled data does not. A model can only know what its data showed it; gaps become blind spots.

Garbage in, garbage out

A model learns the flaws in its data and won't fix them on its own. Four things shape the result: quality, quantity, representativeness, and bias. Unrepresentative data produces a biased model; a model can even amplify a skew it was trained on.

Collecting, labelling & cleaning

Labelling (annotation) attaches the correct answer to each example — often human work. RLHF labels are human judgments of which response is better. Cleaning and deduplication remove errors and repeated copies so the model learns from accurate, varied examples.

Synthetic data

Synthetic data is AI-generated training data, used for privacy, scarcity/edge cases, and cost. Its biggest risk is model collapse — training on too much AI output degrades a model, losing variety and drifting from reality. It can also amplify bias. Best used alongside real data.

Consent, copyright & governance

Training data often comes from real people and creators, raising consent and copyright questions. Data governance means knowing a dataset's provenance, respecting consent and licensing, and documenting how it was collected and what it's for.

Gallery

Contacts

What an AI model is really made of: its training data

01What training data actually is

02Garbage in, garbage out

03Where the examples come from: collecting, labelling & cleaning

04Synthetic data: when AI makes its own fuel

05Where the data comes from: consent, copyright & governance

06Check your understanding

07Take it with you & go deeper

What is machine learning? An introductory guide

What is bias in AI?

What is data labelling?

→Continue learning

⊕Concept map

Training data & synthetic data — in 5 minutes

What training data is

Garbage in, garbage out

Collecting, labelling & cleaning

Synthetic data

Consent, copyright & governance

Services

Learn

Company

Gallery

Contacts

01What training data actually is

02Garbage in, garbage out

03Where the examples come from: collecting, labelling & cleaning

04Synthetic data: when AI makes its own fuel

05Where the data comes from: consent, copyright & governance

06Check your understanding

07Take it with you & go deeper

What is machine learning? An introductory guide

What is bias in AI?

What is data labelling?

→Continue learning

⊕Concept map

→Related lessons

Training data & synthetic data — in 5 minutes

What training data is

Garbage in, garbage out

Collecting, labelling & cleaning

Synthetic data

Consent, copyright & governance

Services

Learn

Company