Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Language lesson
Track 02 · Language Intermediate ~8 min

How a model picks the next word

At every step a language model produces a probability for every possible next token. Decoding is how one token gets chosen from that list — and three knobs shape the choice: temperature, top-k, and top-p. Learn what each one does, then reshape a live probability chart and draw tokens yourself.

Module progress
0%

01It all starts with a list of probabilities

A language model doesn't "decide" a sentence all at once. It works one token at a time (a token is roughly a word or word-piece). For the next position it outputs a raw score — a logit — for every token in its vocabulary, then a softmax turns those scores into a probability distribution: a number for each candidate, all adding up to 1. After the prompt "The weather today is", the model might put high probability on "sunny", "cloudy", and "warm", and tiny slivers on thousands of other words. Decoding is simply the rule you use to turn that list of probabilities into one chosen token — and then you repeat for the next position.

  • Every step produces a full probability distribution over the whole vocabulary, not a single answer.
  • Greedy decoding just takes the single highest-probability token every time — simple, but it tends to produce bland, repetitive text (Holtzman et al., 2019).
  • Sampling instead draws a token according to the probabilities, trading some predictability for variety.

02Temperature: reshaping the whole distribution

Temperature scales the logits by dividing them before the softmax is applied (Hinton et al., 2015). A low temperature sharpens the distribution — the top token's lead grows, so output becomes more deterministic and focused. A high temperature flattens it — long-shot tokens get more of a chance, so output becomes more diverse and surprising. At the limit, temperature → 0 is exactly greedy decoding: the argmax token wins every time. By convention T = 1 leaves the model's raw distribution unchanged. Crucially, temperature reshapes how probable each token is — it acts on a different axis from the truncation knobs you'll meet next.

  • Low T (toward 0): sharper, more deterministic, repeats the model's favourites.
  • High T: flatter, more random, more willing to pick unusual tokens.
  • Ranges and defaults are provider-specific — OpenAI accepts 0–2 (default 1); Anthropic accepts 0.0–1.0 (default 1.0). Always check the specific docs.

03Top-k and top-p: cutting off the tail

Even after temperature, the model still assigns a tiny probability to thousands of nonsense tokens. Sampling from that whole "long tail" is what produces gibberish. Truncation fixes this by throwing the tail away before you sample. Top-k sampling keeps only the k highest-probability tokens, renormalizes over them, and samples from that fixed-size set (Fan et al., 2018). Top-p sampling — also called nucleus sampling — instead keeps the smallest set of tokens whose probabilities add up to at least p, then renormalizes over that "nucleus" (Holtzman et al., 2019). The key difference: top-k fixes the count of candidates, while top-p fixes the probability mass — so the number of candidates in top-p adapts to each step, staying small when the model is confident and widening when it's unsure.

  • Top-k = a fixed number of candidates (e.g. keep the top 40). Gemini's documented default is topK = 40; Hugging Face's top_k default is 50.
  • Top-p (nucleus) = a fixed cumulative probability (e.g. 0.9). The candidate set size changes per token, so it handles peaked and flat steps differently.
  • Truncation answers which tokens are eligible; temperature answers how their probabilities are shaped. They're complementary, not interchangeable.

04See it work: reshape the distribution, then sample

Here is a toy next-token distribution after the prompt "The weather today is", over a small make-believe vocabulary. Move the three sliders and watch the bars: temperature reshapes every bar, while top-k and top-p grey out the tokens that get cut. Faded bars are out of the running. Then hit Sample to draw a token according to the current settings — at temperature near 0 you'll almost always get the same word; crank the heat and widen the cutoffs and the draws spread out. These probabilities are illustrative, chosen to make the mechanics visible.

InteractiveDrag the sliders · then Sample
Temperature1.00
0 = greedy (always the top bar) · higher = flatter, more random.
Top-koff
Keep only the k highest bars (0 = keep all). Fixes the count.
Top-p (nucleus)1.00
Keep the smallest set of bars summing to ≥ p. Fixes the mass.
Drawn tokens (newest first)
No draws yet — press Sample a token.
  • Slide temperature to 0 (or hit Greedy): the top bar takes essentially all the mass and every draw is the same token.
  • Lower top-p on a peaked distribution and very few bars survive; the same p on a flatter (high-T) distribution keeps more — that's the dynamic candidate set.
  • Many providers suggest tuning either temperature or top-p, not both at once, because their effects compound (OpenAI; Google).

05In practice: defaults, order, and newer methods

In real APIs these knobs are exposed as parameters with provider-specific defaults. In Hugging Face Transformers, sampling only happens when you set do_sample=True; otherwise you get greedy or beam decoding, and the defaults are top_k=50, top_p=1.0, temperature=1.0. When these knobs are combined, a common order — and the one this lesson's interactive uses — is temperature first (it scales the logits), then top-k, then top-p before the final sample. That matches frameworks like Hugging Face Transformers, which apply the temperature warper before the top-k and top-p warpers. But the exact order is implementation-dependent — some providers document a different sequence (Anthropic, for example) — so don't assume one universal order. Two honest caveats: even at temperature 0, outputs aren't guaranteed to be perfectly deterministic in practice (Anthropic), and some newer models restrict or drop these parameters entirely, so always re-check current docs. Top-k and top-p aren't the end of the story either — researchers have proposed typical sampling (Meister et al., 2022), eta/epsilon-sampling (Hewitt et al., 2022), min-p (Nguyen et al., 2024), Mirostat (Basu et al., 2020), and deterministic contrastive search (Su et al., 2022) as alternatives.

  • Defaults and valid ranges differ by provider and model — cite the specific docs, not a universal number.
  • Application order (temperature → k → p) is a common pattern, not a guarantee.
  • "Temperature 0" usually means very low randomness, not a hard determinism guarantee.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Decoding & sampling" — one-page summary
Temperature, top-k and top-p distilled to a printable cheat-sheet.
▸ Already on the site — go deeper
▸ Coming next — deeper progression
Coming soon

Mixture of Experts (MoE)

How modern models route each token through a subset of specialist sub-networks.

Coming soon
Coming soon

Inference optimization (KV-cache, batching)

What makes generating each of those tokens fast and cheap at scale.

Coming soon

Continue learning

Concept map

Expand each branch to see how this lesson's ideas connect — from the raw probability list to the three knobs that reshape and trim it.

It all starts with a list of probabilities
  • Each step produces a full probability distribution over every token in the vocabulary, not a single answer.
  • Decoding is the rule that turns that list into one chosen token, then repeats for the next position.
  • Greedy decoding always picks the argmax (highest-probability) token — simple, but tends toward bland, repetitive text.
Temperature: reshaping the whole distribution
  • Temperature divides (scales) the logits before the softmax, reshaping the entire distribution.
  • Low temperature sharpens it (more deterministic); high temperature flattens it (more diverse).
  • Greedy decoding is the temperature → 0 limit; at T = 1 the raw distribution is unchanged.
  • Ranges and defaults are provider-specific — OpenAI 0–2 (default 1), Anthropic 0.0–1.0 (default 1.0).
Top-k and top-p: cutting off the tail
  • Top-k keeps the k highest-probability tokens, renormalizes, and samples from that fixed-size set.
  • Top-p (nucleus) keeps the smallest set of tokens whose cumulative probability is ≥ p.
  • Top-k fixes the count of candidates; top-p fixes the probability mass, so its set adapts per step.
  • Truncation decides which tokens are eligible; temperature decides how their probabilities are shaped.
See it work: reshape, then sample
  • At temperature near 0 the top bar takes essentially all the mass, so every draw is the same token.
  • The same top-p keeps few bars on a peaked distribution but more on a flatter one — the dynamic candidate set.
  • Providers suggest tuning either temperature or top-p, not both, because their effects compound.
In practice: defaults, order, and newer methods
  • In Hugging Face Transformers, sampling only applies when do_sample=True; defaults are top_k=50, top_p=1.0, temperature=1.0.
  • Application order (temperature → k → p) is a common pattern, not a universal guarantee.
  • Even at temperature 0, outputs aren't guaranteed to be perfectly deterministic in practice.
  • Alternatives exist — typical sampling, eta/epsilon-sampling, min-p, Mirostat, and contrastive search.
Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established decoding concepts and is grounded in the references below; the probabilities in the interactive are illustrative and labelled as such. Parameter defaults and ranges differ by provider and model — always check current vendor docs.

Responsible use & transparency

This is an educational explainer, not professional advice. The bar chart and sampled draws use invented, illustrative probabilities to make the mechanics visible — they are not the output of any specific model. Decoding-parameter names, defaults, and valid ranges differ by provider and model and change over time; verify against the current official documentation before relying on any value.

Tuning these knobs changes how varied or deterministic a model's output is, but it does not make output factual. Higher temperature can increase plausible-sounding but incorrect text. For decisions in medical, legal, financial, or other high-stakes domains, always consult a qualified professional and verify AI output against authoritative sources. See the NIST AI Risk Management Framework for responsible-AI guidance.

Decoding & sampling — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

The setup

At each step a model outputs a probability for every possible next token (a softmax over the vocabulary). Decoding is the rule that turns that list into one chosen token, repeated step by step.

Greedy decoding

Always take the single highest-probability token. Simple, but tends toward bland, repetitive text (Holtzman et al., 2019). It is the temperature→0 limit of temperature sampling.

Temperature

Scales (divides) the logits before the softmax. Low temperature sharpens the distribution (more deterministic); high temperature flattens it (more diverse). T=1 leaves the raw distribution unchanged. Ranges/defaults are provider-specific (OpenAI 0–2; Anthropic 0.0–1.0).

Top-k vs top-p (truncation)

Top-k keeps the k highest-probability tokens (a fixed count) and renormalizes. Top-p (nucleus) keeps the smallest set of tokens whose cumulative probability is at least p (a fixed mass), so its candidate count adapts per step. Truncation chooses which tokens are eligible; temperature shapes their probabilities.

In practice

In Hugging Face, sampling applies only with do_sample=True (defaults top_k=50, top_p=1.0, temperature=1.0). A common order (and the one this lesson's interactive uses) is temperature → top-k → top-p — temperature scales the logits before truncation, matching Hugging Face's warper sequence — but it's implementation-dependent (Anthropic documents a different order). Even at temperature 0 output isn't guaranteed perfectly deterministic. Providers suggest tuning either temperature or top_p, not both. Newer methods: typical, eta/epsilon, min-p, Mirostat, contrastive search.