Quantization: shrink a model to run it on your own machine
A big AI model is just billions of numbers. Store each number with fewer digits and the whole model gets smaller and faster — with only a small dip in quality. That trick is called quantization, and it's what lets a model that once needed a datacenter run on your laptop. Learn what changes, what it costs, and when running locally beats the cloud.
01Why big models are heavy
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
Imagine a recipe book with billions of measurements written inside it. A large AI model is much like that: it's a giant collection of numbers called weights (or parameters), learned during training. When you "run" a model, your computer has to load every one of those numbers into memory. The more weights there are — and large models have billions of them — the more memory the model takes up. That's the simple reason today's best models feel "heavy": there are just an enormous number of values to hold. And here's the part that matters for the rest of this lesson: each weight is stored as a number, and how precisely you store each number decides how much space it needs. Store the same billions of weights with smaller numbers, and the whole model shrinks.
- A model's size comes from its weights — billions of learned numbers it loads into memory to run.
- Memory used is roughly the number of weights times the space each number takes.
- So there are two ways to shrink a model: fewer weights, or fewer digits per weight — quantization does the second.
02What quantization is: the precision ladder
Quantization means storing each weight at lower numeric precision — using fewer bits per number. Think of rounding 3.14159265 down to 3.14: you lose a little detail, but the number takes far less space to write. Models step down a ladder of formats, from full precision to very compact integers. Tap each rung to see what it is and what it trades.
FP32 — full precision
The "original" format many models are trained in: a 32-bit floating-point number per weight. It captures the most detail, which means the best fidelity — and the largest memory footprint. It's the baseline the other formats shrink down from.
03See the trade-off: the precision slider
Now feel the trade-off directly. Slide from FP32 down to INT4 and watch three things move: the model's size shrinks, its speed rises, and its quality dips a little. That's quantization in one picture, you give up a small amount of quality to get a smaller, faster model. Use the slider or the arrow keys.
FP32 — full precision
The baseline format: maximum fidelity, but the largest and slowest to run. Every step down trades a little quality for a smaller, faster model.
04Two ways to quantize: after training or during it
There are two moments you can quantize a model. The common, quick way is post-training quantization (PTQ): you take a model that's already finished training and convert its weights to lower precision, with no retraining required. It's fast and convenient, and for many models the quality cost is small. The more involved way is quantization-aware training (QAT): you account for quantization while the model is still training, so the model learns to cope with lower precision from the start. QAT takes extra training effort, but because the model adapts as it learns, it can hold on to more quality at very low precision. Put simply, PTQ quantizes a finished model; QAT trains the model with quantization in mind.
- Post-training quantization (PTQ): quantize an already-trained model, quick, no retraining, with a possible small quality cost.
- Quantization-aware training (QAT): bake quantization into training so the model adapts, more effort, often better quality at low precision.
- PTQ is the usual starting point; reach for QAT when the quality at low precision isn't good enough.
05Running it locally: tooling, formats & when it makes sense
Because a quantized model is small enough to fit on everyday hardware, you can run it locally, on your own laptop or a consumer GPU, instead of calling a cloud service. A small ecosystem of tools makes that practical. Switch between the three views to meet the tooling, weigh the trade-offs, and see when local beats cloud.
The tooling, at a high level
A handful of widely-used, neutral tools make local quantized models practical. GGUF is a model file format for quantized models; llama.cpp is an inference engine that popularized running them on CPUs and laptops. Ollama downloads and runs models locally with a single command. bitsandbytes is a library that provides low-bit quantization, often used from frameworks like Hugging Face Transformers.
The trade-offs, size vs quality vs speed
Lower precision pulls three levers at once. Size: fewer bits per weight means a smaller memory footprint, the reason a model fits on your machine at all. Speed: smaller, lower-precision weights often run faster. Quality: there's a small cost that tends to grow the further you push precision down. The skill is picking the lowest precision whose quality is still good enough for your task.
When local makes sense, and when cloud wins
Running a quantized model locally is attractive for three reasons: privacy (your data stays on the device), cost (no per-query cloud charge), and offline use (no connection required). The cloud still wins when you need the very largest models or the highest quality, or when you would rather not manage hardware. Many setups mix both, local for routine, private work; cloud for the heaviest lifts.
06Check your understanding
07Take it with you & go deeper
Quantization — AI Glossary
The concise definition, plus related terms, in the AI Glossary.
Open →Fine-tuning & customizing models
The companion module on adapting models — prompting, RAG, and fine-tuning, in depth.
Open →Local-inference deep dive (GGUF, llama.cpp, Ollama)
How to pick a quantized build, run it with a local engine, and choose the right precision for your hardware.
Coming soonChoosing a precision for your task
How to weigh size, speed, and quality — and decide how far down the precision ladder you can safely go.
Coming soon→Continue learning
⊕Concept map
The whole lesson in one expandable tree — open a branch to see the key ideas under it.
Why big models are heavy
- A model's size comes from its weights — billions of learned numbers it loads into memory to run.
- Memory used is roughly the number of weights times the space each number takes.
- Two ways to shrink a model: fewer weights, or fewer digits per weight — quantization does the second.
What quantization is: the precision ladder
- Quantization stores each weight at lower numeric precision — fewer bits per number.
- Like rounding 3.14159 to 3.14: a little detail is lost, but it takes far less space.
- A common ladder from higher to lower precision: FP32 → FP16 → INT8 → INT4.
See the trade-off: size, speed, quality
- Step precision down and three things move: size shrinks, speed rises, quality dips a little.
- The quality cost is small at first and tends to grow the further you push precision down.
- The skill is picking the lowest precision whose quality is still good enough for your task.
Two ways to quantize: after training or during it
- Post-training quantization (PTQ): quantize an already-trained model — quick, no retraining, possible small quality cost.
- Quantization-aware training (QAT): bake quantization into training so the model adapts — more effort, often better quality at low precision.
- PTQ is the usual starting point; reach for QAT when low-precision quality isn't good enough.
Running it locally: tooling & when it makes sense
- A small ecosystem makes local models practical: GGUF (file format), llama.cpp (engine), Ollama (one-command run), bitsandbytes (low-bit library).
- Local wins for privacy, cost, and offline use.
- Cloud still wins for the largest models, peak quality, or when you'd rather not manage hardware — many setups mix both.
→Related lessons
- → AI Alignment & RLHF Explained (2026 Guide)
- → What Are AI Coding Assistants? A 2026 Guide
- → AI Red Teaming Explained: A 2026 Guide
- → AI Regulation & Compliance Explained (2026)
- → AI Chatbots Explained: How They Work (2026)
- → Convolutional Neural Networks (CNNs) Explained 2026
- → Fine-Tuning & Customizing Models
- → MLOps Explained: Deploying AI Models (2026)
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.
- Quantization — Hugging Face Transformers
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Dettmers et al. (2022)
- QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. (2023)
- llama.cpp — ggerganov (GitHub)
- Ollama — ollama.com
Quantization & running AI locally — in 5 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
Why big models are heavy
A model is billions of learned numbers (weights). Memory used is roughly the number of weights times the space each number takes. So you can shrink a model with fewer weights, or fewer digits per weight — quantization does the second.
What quantization is
Quantization stores each weight at lower numeric precision (fewer bits) to cut memory and often speed up inference, at a small quality cost. A common ladder from higher to lower precision: FP32 (full) → FP16 (half) → INT8 (8-bit) → INT4 (4-bit).
The trade-off
Lower precision makes the model smaller and often faster, which is what lets large models fit on a laptop or consumer GPU — but quality dips slightly, and the cost grows the further you push precision down. Pick the lowest precision whose quality is still good enough.
PTQ vs QAT
Post-training quantization (PTQ) converts an already-trained model to lower precision — quick, no retraining. Quantization-aware training (QAT) accounts for quantization during training so the model adapts, often preserving more quality at low precision, at the cost of extra training effort.
Running locally & tooling
A smaller model can run locally instead of in the cloud. High-level tooling: GGUF (file format), llama.cpp (engine for laptops/CPUs), Ollama (one-command local run), bitsandbytes (low-bit quantization from a framework). Local makes sense for privacy, cost, and offline use; cloud wins for the largest models and peak quality.
sources.json for grounding and editorial cautions.