Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Learning lesson
Track 03 · Applied & Agentic Intermediate ~8 min

Quantization: shrink a model to run it on your own machine

A big AI model is just billions of numbers. Store each number with fewer digits and the whole model gets smaller and faster — with only a small dip in quality. That trick is called quantization, and it's what lets a model that once needed a datacenter run on your laptop. Learn what changes, what it costs, and when running locally beats the cloud.

Module progress
0%

01Why big models are heavy

Imagine a recipe book with billions of measurements written inside it. A large AI model is much like that: it's a giant collection of numbers called weights (or parameters), learned during training. When you "run" a model, your computer has to load every one of those numbers into memory. The more weights there are — and large models have billions of them — the more memory the model takes up. That's the simple reason today's best models feel "heavy": there are just an enormous number of values to hold. And here's the part that matters for the rest of this lesson: each weight is stored as a number, and how precisely you store each number decides how much space it needs. Store the same billions of weights with smaller numbers, and the whole model shrinks.

  • A model's size comes from its weights — billions of learned numbers it loads into memory to run.
  • Memory used is roughly the number of weights times the space each number takes.
  • So there are two ways to shrink a model: fewer weights, or fewer digits per weight — quantization does the second.

02What quantization is: the precision ladder

Quantization means storing each weight at lower numeric precision — using fewer bits per number. Think of rounding 3.14159265 down to 3.14: you lose a little detail, but the number takes far less space to write. Models step down a ladder of formats, from full precision to very compact integers. Tap each rung to see what it is and what it trades.

ExploreTap a precision level
From higher precision to lower precision
FP32full precision
FP16half precision
INT88-bit integer
INT44-bit integer
Highest precision

FP32 — full precision

The "original" format many models are trained in: a 32-bit floating-point number per weight. It captures the most detail, which means the best fidelity — and the largest memory footprint. It's the baseline the other formats shrink down from.

03See the trade-off: the precision slider

Now feel the trade-off directly. Slide from FP32 down to INT4 and watch three things move: the model's size shrinks, its speed rises, and its quality dips a little. That's quantization in one picture, you give up a small amount of quality to get a smaller, faster model. Use the slider or the arrow keys.

InteractiveSlide the precision down
more precise smaller & faster
Highest precision

FP32 — full precision

The baseline format: maximum fidelity, but the largest and slowest to run. Every step down trades a little quality for a smaller, faster model.

Size
100
Speed
40
Quality
100
Illustrative, not measured. These numbers show the direction of the trade-off only, they are not benchmark results for any specific model.

04Two ways to quantize: after training or during it

There are two moments you can quantize a model. The common, quick way is post-training quantization (PTQ): you take a model that's already finished training and convert its weights to lower precision, with no retraining required. It's fast and convenient, and for many models the quality cost is small. The more involved way is quantization-aware training (QAT): you account for quantization while the model is still training, so the model learns to cope with lower precision from the start. QAT takes extra training effort, but because the model adapts as it learns, it can hold on to more quality at very low precision. Put simply, PTQ quantizes a finished model; QAT trains the model with quantization in mind.

  • Post-training quantization (PTQ): quantize an already-trained model, quick, no retraining, with a possible small quality cost.
  • Quantization-aware training (QAT): bake quantization into training so the model adapts, more effort, often better quality at low precision.
  • PTQ is the usual starting point; reach for QAT when the quality at low precision isn't good enough.

05Running it locally: tooling, formats & when it makes sense

Because a quantized model is small enough to fit on everyday hardware, you can run it locally, on your own laptop or a consumer GPU, instead of calling a cloud service. A small ecosystem of tools makes that practical. Switch between the three views to meet the tooling, weigh the trade-offs, and see when local beats cloud.

ExploreSwitch view

The tooling, at a high level

A handful of widely-used, neutral tools make local quantized models practical. GGUF is a model file format for quantized models; llama.cpp is an inference engine that popularized running them on CPUs and laptops. Ollama downloads and runs models locally with a single command. bitsandbytes is a library that provides low-bit quantization, often used from frameworks like Hugging Face Transformers.

GGUF a file format for packaging quantized models
llama.cpp an engine to run them on laptops / CPUs
Ollama one-command local download & run
bitsandbytes low-bit quantization from a framework

The trade-offs, size vs quality vs speed

Lower precision pulls three levers at once. Size: fewer bits per weight means a smaller memory footprint, the reason a model fits on your machine at all. Speed: smaller, lower-precision weights often run faster. Quality: there's a small cost that tends to grow the further you push precision down. The skill is picking the lowest precision whose quality is still good enough for your task.

size smaller footprint · fits on consumer hardware
speed often faster inference at lower precision
quality small cost that grows at very low precision

When local makes sense, and when cloud wins

Running a quantized model locally is attractive for three reasons: privacy (your data stays on the device), cost (no per-query cloud charge), and offline use (no connection required). The cloud still wins when you need the very largest models or the highest quality, or when you would rather not manage hardware. Many setups mix both, local for routine, private work; cloud for the heaviest lifts.

local privacy · no per-call cost · works offline
cloud largest models · peak quality · no hardware to manage
often mix the two by task

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Quantization & running AI locally" — one-page summary
The whole module distilled to a printable cheat-sheet.
▸ Already on the site — go deeper
▸ Coming next — deeper progression
Coming soon

Local-inference deep dive (GGUF, llama.cpp, Ollama)

How to pick a quantized build, run it with a local engine, and choose the right precision for your hardware.

Coming soon
Coming soon

Choosing a precision for your task

How to weigh size, speed, and quality — and decide how far down the precision ladder you can safely go.

Coming soon

Continue learning

Concept map

The whole lesson in one expandable tree — open a branch to see the key ideas under it.

Why big models are heavy
  • A model's size comes from its weights — billions of learned numbers it loads into memory to run.
  • Memory used is roughly the number of weights times the space each number takes.
  • Two ways to shrink a model: fewer weights, or fewer digits per weight — quantization does the second.
What quantization is: the precision ladder
  • Quantization stores each weight at lower numeric precision — fewer bits per number.
  • Like rounding 3.14159 to 3.14: a little detail is lost, but it takes far less space.
  • A common ladder from higher to lower precision: FP32 → FP16 → INT8 → INT4.
See the trade-off: size, speed, quality
  • Step precision down and three things move: size shrinks, speed rises, quality dips a little.
  • The quality cost is small at first and tends to grow the further you push precision down.
  • The skill is picking the lowest precision whose quality is still good enough for your task.
Two ways to quantize: after training or during it
  • Post-training quantization (PTQ): quantize an already-trained model — quick, no retraining, possible small quality cost.
  • Quantization-aware training (QAT): bake quantization into training so the model adapts — more effort, often better quality at low precision.
  • PTQ is the usual starting point; reach for QAT when low-precision quality isn't good enough.
Running it locally: tooling & when it makes sense
  • A small ecosystem makes local models practical: GGUF (file format), llama.cpp (engine), Ollama (one-command run), bitsandbytes (low-bit library).
  • Local wins for privacy, cost, and offline use.
  • Cloud still wins for the largest models, peak quality, or when you'd rather not manage hardware — many setups mix both.
Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

Quantization & running AI locally — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Why big models are heavy

A model is billions of learned numbers (weights). Memory used is roughly the number of weights times the space each number takes. So you can shrink a model with fewer weights, or fewer digits per weight — quantization does the second.

What quantization is

Quantization stores each weight at lower numeric precision (fewer bits) to cut memory and often speed up inference, at a small quality cost. A common ladder from higher to lower precision: FP32 (full) → FP16 (half) → INT8 (8-bit) → INT4 (4-bit).

The trade-off

Lower precision makes the model smaller and often faster, which is what lets large models fit on a laptop or consumer GPU — but quality dips slightly, and the cost grows the further you push precision down. Pick the lowest precision whose quality is still good enough.

PTQ vs QAT

Post-training quantization (PTQ) converts an already-trained model to lower precision — quick, no retraining. Quantization-aware training (QAT) accounts for quantization during training so the model adapts, often preserving more quality at low precision, at the cost of extra training effort.

Running locally & tooling

A smaller model can run locally instead of in the cloud. High-level tooling: GGUF (file format), llama.cpp (engine for laptops/CPUs), Ollama (one-command local run), bitsandbytes (low-bit quantization from a framework). Local makes sense for privacy, cost, and offline use; cloud wins for the largest models and peak quality.

Before you act on AI output. This is an educational module. AI systems can produce plausible-sounding but incorrect guidance. For decisions that carry real consequences — security, legal, financial, medical, or compliance — verify with a qualified professional before acting. The size, speed, and quality figures in the precision-slider interactive are illustrative teaching values, not measured results or benchmark claims about any specific product, vendor, or model. External links are provided for learning and may change; confirm against the official source. See sources.json for grounding and editorial cautions.