How are AI models tested — and how do you read a leaderboard?
Every week a new model "tops the charts." But what chart, measuring what, and can you trust the number? This module is a literacy kit: what benchmarks actually measure, how human-preference leaderboards work, the ways a score can mislead you — and how to read any ranking with a healthy dose of skepticism.
01Why evaluation matters
A vendor will always say its model is "state of the art" — so how would you check? You give it a test, the way a school does, and that act of testing-and-grading is what we call evaluation. The standardized test itself is a benchmark: a fixed set of tasks plus a way to score them, so two different models can be measured on the same footing. The catch — which this whole module is about — is that a single number can be precise and still mislead you.
- A benchmark = a dataset (the tasks) + a scoring method (how answers are graded).
- A score measures one construct on one task — not overall fitness for your job.
- "Tops the leaderboard" is the start of a question, not the end of one.
02What the major benchmarks actually measure
Benchmarks come in families — knowledge, coding, math, reasoning, and human preference. Tap each one to see what it measures, what a high score does and doesn't tell you, and the one caveat to keep in mind. Then flip the contamination switch to see how a leaked test can inflate a score without the model getting any smarter.
When the test is clean, the score reflects real ability. Flip the switch to leak the test answers into training — the displayed score jumps even though the model is no better. Illustrative only — these bars are a mechanism demo, not a measured score for any model.
03Human-preference leaderboards & Elo
Not all rankings come from fixed tests. On a human-preference leaderboard like LMArena (Chatbot Arena), people are shown two anonymous answers to the same prompt and vote which is better. Those blind, head-to-head votes are aggregated into an Elo rating — the same relative-ranking idea used in chess. A higher Elo means a model is preferred more often, not that it scored a particular percentage.
- Elo is relative. It ranks who wins head-to-head — there's no absolute "right answers" percentage.
- Preference voting captures perceived helpfulness and style in open-ended chat.
- But people can prefer a confident, fluent answer that's actually wrong — preference is not the same as correctness.
04Why a benchmark score can mislead
A clean-looking score can hide several traps. These are the four worth knowing by name — they're the questions a skeptical reader asks before trusting any ranking.
- Data contamination. If the test questions (or close variants) leaked into the training data, a high score reflects memorization, not capability. This is why contamination-aware benchmarks like LiveCodeBench time-stamp problems and test only on tasks released after a model's training.
- Overfitting to the benchmark. A model can be tuned to ace a specific test in ways that don't carry over to real work — sometimes called "benchmark gaming." A great score on the test, an ordinary tool in practice.
- "Lost in the middle." On long inputs, models often use information at the start or end well but miss what's buried in the middle — so a strong long-context claim can quietly fail where it matters.
- Construct validity. The deepest question: does the benchmark actually measure the real-world ability it claims to, or just a convenient proxy that correlates with it?