How does text-to-image work?
Type a sentence, get a brand-new picture. The trick behind most image generators is diffusion: start from pure random noise and remove a little of it at a time, step by step, steering toward what your prompt described. Learn the idea, how it is trained, how the prompt steers it, and how to use it responsibly - right here on the page.
01The big idea: denoise your way to a picture
Imagine starting with a screen of TV static and slowly wiping away the fuzz until a clear picture appears underneath - that gradual clearing is the trick behind most image generators, and the method has a name: diffusion. A diffusion model does not paint a picture in one stroke, and it does not paste together bits of stored photos. It begins with a square of pure random noise - the TV static - and then takes many small steps, removing a little noise at each one. Every step nudges the image toward something that matches your prompt, until a clear picture emerges. Step through it below and watch the static resolve into a shape.
guidance: prompt-led
Step 0 - pure noise. Press Step to remove a little noise and reveal a bit more of the target.
- Generation starts from random noise, not from a blank canvas or a stored image.
- It proceeds iteratively - many small denoising steps, each one removing a bit more noise.
- At each step the model estimates the noise to remove, moving the sample toward a coherent image.
02Training: add noise, then learn to undo it
How does a model learn to denoise? Cleverly - by being shown the problem in reverse. During training, real images are deliberately corrupted with noise, a little more at each step, until almost nothing is left. The model job is to learn the opposite direction: given a noisy image, predict the noise so it can be removed. Generation is then just running that learned reverse process, starting from fresh noise. Step through it.
- The forward process adds noise to real images; it is only used to create training examples.
- The model learns the reverse: predict the noise so it can be removed, step by step.
- Sampling is the reverse run from scratch - noise in, new image out - so outputs are original, not copies.
03Steering the result: text conditioning & guidance
Plain denoising would produce some image, but not necessarily the one you asked for. Text conditioning is what ties the picture to your words: the prompt is fed in as a condition at every denoising step, so the emerging image matches the description. A few related ideas make this work - and let you dial how literally the model follows you. Explore them below.
The prompt - a condition at every step
Your text is supplied to the model as a condition during denoising. Instead of resolving toward just any image, each step is steered toward one that fits the words - colors, objects, style, and composition all follow the description.
The text-image link (CLIP-style)
To compare "what the words mean" with "what the image shows," generators lean on text-image models such as CLIP. These learn a shared representation of text and pictures, so the prompt can meaningfully guide the generator toward a matching image.
Guidance strength (classifier-free guidance)
Classifier-free guidance lets you turn up how strongly the image follows the prompt - without needing a separate classifier. Push it higher and the result hews closer to the words; push too high and you can lose variety or realism. It is a trade-off dial, not a quality knob.
Saying what to avoid (negative prompts)
Many tools let you add a negative prompt - a description of things to steer away from. It is the same conditioning idea, applied in reverse: the denoising is nudged to keep the unwanted traits out of the final image.
- The prompt conditions every step, steering the picture toward the words.
- CLIP-style text-image models give the prompt and image a shared language so guidance works.
- Classifier-free guidance trades stronger prompt-following against variety and naturalness.