What an AI model is really made of: its training data
A model is only as good as what it learned from. Before it can answer anything, it studies a huge pile of examples — and the quality of those examples quietly decides almost everything. Learn what training data is, why "garbage in, garbage out" rules everything, and what happens when AI starts learning from data made by other AI — right here on the page.
01What training data actually is
The AI Data Governance & Quality Assessment — a checklist to keep your data trustworthy.
Get the checklist Browse all templatesYour purchase helps keep our hubs free to read.
Think of a model like a student who has never seen the world — the only thing it ever learns from is the pile of examples you hand it. That pile is the training data. If you want a model to recognise cats, you show it many pictures; if you want it to write, you show it lots of writing. The model studies those examples and slowly works out the patterns inside them. Nothing else teaches it. So the data isn't a side ingredient — it is the lesson.
There are two flavours of examples. Labelled data comes with the right answer already attached: a photo tagged "cat", an email marked "spam". Unlabelled data is just the raw examples with no answer attached — a pile of photos with nothing written on them. Labelled data is more expensive because a person usually has to add each label by hand, but it lets a model learn a clear input-to-answer mapping. The big idea to carry through this whole lesson: a model can only ever know what its training data showed it. Whatever is missing from the data tends to become a blind spot in the model.
- Training data is the set of examples a model learns from — it's the model's entire education.
- Labelled data has the correct answer attached to each example; unlabelled data does not.
- A model's knowledge is bounded by its data — gaps in the data become blind spots in the model.
02Garbage in, garbage out
Here is the single most important idea about training data: "garbage in, garbage out." A model faithfully learns whatever is in its data — including the flaws. It will not notice that the data is biased, too small, or mislabelled and quietly fix it for you. It learns the problems right along with the patterns. Four things about the data shape what kind of model you end up with: its quality (how accurate and clean the examples are), its quantity (whether there are enough of them), its representativeness (whether it reflects the real situations the model will face), and the bias it carries (skews the model will learn and can even amplify). Toggle the data conditions below and watch the resulting model behaviour change.
- Garbage in, garbage out: a model learns the flaws in its data and won't fix them on its own.
- Quality, quantity, representativeness, and bias in the data each shape what the model becomes.
- Bias usually enters through the data — and a model can amplify a skew it was trained on.
03Where the examples come from: collecting, labelling & cleaning
Good training data rarely arrives ready to use — it gets collected, labelled, and cleaned first. Collection means gathering the raw examples: photos, text, recordings, records. Labelling (also called annotation) is the work of attaching the correct answer to each example — marking a photo as "cat", or tagging which part of a sentence is a name. A lot of this is done by people, one example at a time, which is slow and careful work; the model is only as trustworthy as those human judgments. For chat-style models there's a special kind of label: in RLHF — reinforcement learning from human feedback — people compare two model responses and mark which is better, and those preference judgments become the signal that shapes the model toward what humans actually want.
Before any of that data is used, teams clean it: they remove errors and noise, fix or drop broken examples, and deduplicate — strip out repeated copies. Deduplication matters more than it sounds: if the same example appears hundreds of times, the model treats it as far more common than it really is and over-learns it. Cleaning and dedup usually make a dataset smaller, but more accurate and more varied — which is exactly what makes a better model.
- Labelling (annotation) attaches the correct answer to each example — often careful human work.
- RLHF labels are human judgments of which response is better, used to align chat models with human preferences.
- Cleaning and deduplication remove errors and repeated copies so the model learns from accurate, varied examples.
04Synthetic data: when AI makes its own fuel
Real-world data can be expensive, scarce, or sensitive — so teams increasingly use synthetic data: examples generated by AI rather than collected from the real world. It's appealing for a few clear reasons. Privacy — generated records can stand in for sensitive real data about people. Scarcity and edge cases — you can manufacture examples of rare situations that almost never show up in real data. And cost — generating examples can be cheaper than collecting and hand-labelling real ones.
But synthetic data carries real risks, and the headline one is model collapse. If models are trained on too much AI-generated output — output that itself came from AI — small errors compound across each generation. The model loses variety, forgets the rare cases, and slowly drifts away from reality, like a photocopy of a photocopy of a photocopy. A second risk is amplified bias: if the model generating the synthetic data is skewed, every example it produces inherits and concentrates that skew. Synthetic data is a powerful tool, but it works best alongside real data, not as a replacement for it.
- Synthetic data is AI-generated training data, used for privacy, scarcity/edge cases, and cost.
- Model collapse: training on too much AI output degrades a model — it loses variety and drifts from reality.
- Synthetic data can amplify bias from the model that generated it; it's best used to complement real data.
05Where the data comes from: consent, copyright & governance
One question runs underneath everything else: where did the data actually come from? Training data is often drawn from real people and real creators — their photos, their writing, their records — and that raises genuine questions of consent (was it okay to use information about these people?) and copyright (was it okay to use work someone else created?). These aren't abstract concerns; they're at the centre of ongoing debate about how AI systems are built.
Data governance is the umbrella term for handling all of this responsibly: knowing a dataset's provenance (where it came from), respecting consent and licensing, and being accountable for how data is sourced and used. A foundational practice is simply documenting a dataset — recording what it contains, how it was collected, and what it's meant for — so the people who build on it can use it knowingly. You don't need to be a lawyer to take the core lesson: a model inherits not just the patterns in its data, but the responsibilities that came with collecting it.
- Consent and copyright matter because training data often comes from real people and creators with rights.
- Data governance covers provenance, consent, licensing, and accountability for how data is used.
- Documenting a dataset's origin and intended use is a foundation of responsible, transparent AI.
06Check your understanding
07Take it with you & go deeper
What is bias in AI?
How skewed data turns into unfair outcomes — and what teams do about it.
Coming soonWhat is data labelling?
A closer look at annotation, human feedback, and the people behind the labels.
Coming soon→Continue learning
⊕Concept map
The whole lesson in one expandable tree — open a branch to see the key ideas under it.
What training data actually is
- Training data is the set of examples a model learns from — its entire education.
- Labelled data has the correct answer attached to each example; unlabelled data does not.
- A model's knowledge is bounded by its data — gaps in the data become blind spots in the model.
Garbage in, garbage out
- A model faithfully learns whatever is in its data — including the flaws — and won't fix them on its own.
- Four properties shape the result: quality, quantity, representativeness, and bias.
- Bias usually enters through the data — and a model can amplify a skew it was trained on.
Where the examples come from: collecting, labelling & cleaning
- Labelling (annotation) attaches the correct answer to each example — often careful human work.
- RLHF labels are human judgments of which response is better, used to align chat models with human preferences.
- Cleaning and deduplication remove errors and repeated copies so the model learns from accurate, varied examples.
Synthetic data: when AI makes its own fuel
- Synthetic data is AI-generated training data, used for privacy, scarcity/edge cases, and cost.
- Model collapse: training on too much AI output degrades a model — it loses variety and drifts from reality.
- Synthetic data can amplify bias from the model that generated it; it's best used to complement real data.
Where the data comes from: consent, copyright & governance
- Consent and copyright matter because training data often comes from real people and creators with rights.
- Data governance covers provenance, consent, licensing, and accountability for how data is used.
- Documenting a dataset's origin and intended use is a foundation of responsible, transparent AI.
→Related lessons
- → AI Alignment & RLHF Explained (2026 Guide)
- → What Are AI Coding Assistants? A 2026 Guide
- → AI Red Teaming Explained: A 2026 Guide
- → AI Regulation & Compliance Explained (2026)
- → AI Chatbots Explained: How They Work (2026)
- → Convolutional Neural Networks (CNNs) Explained 2026
- → Neural networks
- → Computer Vision
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.
- Datasheets for Datasets — Gebru, Morgenstern, Vecchione et al.
- Data-Centric AI — overview — Data-Centric AI community (MIT)
- Data Cascades in High-Stakes AI — Sambasivan et al. (Google Research)
- The Curse of Recursion: Training on Generated Data Makes Models Forget (model collapse) — Shumailov, Shumaylov, Zhao et al.
Training data & synthetic data — in 5 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
What training data is
The examples a model learns from — its entire education. Labelled data has the right answer attached to each example; unlabelled data does not. A model can only know what its data showed it; gaps become blind spots.
Garbage in, garbage out
A model learns the flaws in its data and won't fix them on its own. Four things shape the result: quality, quantity, representativeness, and bias. Unrepresentative data produces a biased model; a model can even amplify a skew it was trained on.
Collecting, labelling & cleaning
Labelling (annotation) attaches the correct answer to each example — often human work. RLHF labels are human judgments of which response is better. Cleaning and deduplication remove errors and repeated copies so the model learns from accurate, varied examples.
Synthetic data
Synthetic data is AI-generated training data, used for privacy, scarcity/edge cases, and cost. Its biggest risk is model collapse — training on too much AI output degrades a model, losing variety and drifting from reality. It can also amplify bias. Best used alongside real data.
Consent, copyright & governance
Training data often comes from real people and creators, raising consent and copyright questions. Data governance means knowing a dataset's provenance, respecting consent and licensing, and documenting how it was collected and what it's for.