Small language models & on-device AI
The biggest models live in giant data centers — but a growing class of small models runs right on your phone or laptop. Learn what makes a model "small," how distillation and quantization shrink it to fit, what you trade away, and why running AI on-device changes the privacy, latency, and cost picture. Try the size-vs-capability explorer below to see the tradeoff yourself.
01What makes a model "small"?
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
There is no single official cutoff, but researchers who survey the field tend to converge on a working definition: a small language model (SLM) is a decoder-only transformer roughly in the 100 million to 5 billion parameter range. A complementary way to define it leans on capability and constraints — an SLM is one that can still do useful, often specialized work while being small enough to run in resource-constrained settings like a phone or a laptop. Both definitions matter, and crucially, "small" is relative and keeps moving: as frontier models grow, yesterday's "large" can become tomorrow's "small."
- A common convention puts SLMs in the ~100M–5B parameter range — a guideline from surveys, not a hard standard.
- An SLM is also defined by what it enables: useful capability that fits resource-constrained, on-device settings.
- "Small" is a moving target — it's measured relative to the frontier, which keeps rising over time.
02Why a small model can punch above its weight
For years the dominant idea was simple: bigger is better — more parameters and more data meant more capability. Small models challenge that in two ways. First, data quality over raw quantity: Microsoft's Phi line argued that training on carefully curated, "textbook-quality" and synthetic data lets a small model deviate from raw scaling laws. Their phi-1 model (1.3B parameters) reached strong coding performance on a tiny data budget, and Phi-3-mini (3.8B) is reported by Microsoft to rival much larger models. Second, good architecture and training choices: open models like TinyLlama (1.1B) and Google's Gemma (2B/7B) show compact models trained well can outperform older models of similar size.
One important caveat: claims that a small model "rivals" or "beats" a much larger one are almost always vendor-reported benchmark results. They're meaningful signals, but they are not the same as independent, real-world proof — read them as "the maker says," not settled fact.
- Data quality thesis: curated + synthetic "textbook-quality" data lets small models punch above their weight (the Phi line).
- Architecture & training matter: TinyLlama and Gemma show well-trained compact models can beat older same-size ones.
- Read parity claims carefully: "rivals a much bigger model" is usually a vendor-reported benchmark, not independent fact.
03Two ways to shrink a model: distillation & quantization
Beyond training a small model from scratch, there are two well-established techniques for fitting capability into a smaller footprint — and they work on different things.
Knowledge distillation shrinks the model itself. A small "student" model is trained to reproduce a large "teacher" model's outputs — specifically its soft probabilities (how confident it was across all the options), not just the single hard answer. Those soft outputs carry richer information, so the student learns more than it would from labels alone. This idea (Hinton, Vinyals & Dean, 2015) is now used in shipping products: Google reports that Gemma 2's 2B and 9B models were trained with distillation.
Quantization shrinks how the weights are stored. Model weights are normally 16-bit numbers; quantization stores them in lower precision — 8-bit, 4-bit, even lower — which cuts file size and memory and speeds up inference, at the cost of some accuracy. Smart methods limit that cost: GPTQ quantizes to 3–4 bits with little accuracy loss, AWQ protects the small fraction of "salient" weights that matter most, and QLoRA introduced a 4-bit format (NF4) good enough to fine-tune on a single consumer GPU.
- Distillation = a small student learns from a big teacher's soft probabilities — compressing capability into fewer parameters.
- Quantization = store the same weights at lower numeric precision (16-bit → 8/4/lower bit) to cut size, memory, and latency.
- Methods like GPTQ, AWQ, and QLoRA keep quality loss small by being selective about precision — they're the levers behind on-device deployment.
04See the tradeoff: size ↔ capability ↔ where it runs
Picking a model size is a balancing act. Bigger usually means more capable — but also heavier: more memory, slower responses, and harder to run anywhere but a server. Drag the slider to change the model size, then flip on quantization and distillation to watch them shrink the memory footprint and pull capability back up. The "Runs on" badge tells you where a model of that size could realistically live.
Numbers are illustrative — directional, not measured. Real capability, memory, and speed depend on the specific model, quantization level, hardware, and runtime. Examples are anchored on public reports for Phi-3-mini (3.8B), Gemma (2B/7B), TinyLlama (1.1B), and Meta's sub-billion MobileLLM.
05Why run AI on-device at all?
If the cloud has the biggest, most capable models, why bother running a smaller one locally? Because moving inference onto the device changes four things at once. Switch between them below. Real shipped examples include Apple's ~3B on-device model (tuned for Apple silicon with KV-cache sharing and 2-bit quantization-aware training), Google's Gemini Nano on Android via AICore, and Meta's MobileLLM family of sub-billion-parameter models.
Privacy — your data stays on the device
When the model runs locally, your input never has to leave the device. Android's AICore (which serves Gemini Nano) is documented to isolate each request and keep no record of inputs or outputs after processing — a stronger privacy posture than sending text to a remote server.
Latency — no network round-trip
A cloud call has to travel to a server and back. On-device inference removes that round-trip, so responses can feel more immediate — especially valuable for interactive features like autocomplete or live rewriting.
Offline — it works with no connection
Because nothing is sent to a server, an on-device model keeps working on a plane, in a tunnel, or anywhere the network is unreliable. The capability travels with the device.
Cost — no per-call cloud bill
Cloud inference is usually billed per request. Running the model on the user's own hardware avoids that per-call cost entirely — the tradeoff is that you ship a smaller, less capable model and use the device's compute and battery.
06Check your understanding
07Take it with you & go deeper
AI cost optimization (FinOps for LLMs)
Smaller and on-device models are one lever for cutting inference cost — see the full picture.
Read →Inference optimization
KV-cache, batching, and latency — the techniques that make any model faster to serve.
Read →Model serving & deployment patterns
How models get from a checkpoint to a running endpoint — local, edge, and cloud patterns.
Coming soonLLMOps: monitoring & observability
Keeping a deployed model healthy — what to watch once it's in production.
Coming soon⊕Concept map
The core ideas of this lesson at a glance — expand each branch to see how small, on-device models are defined, made capable, shrunk, and run locally.
What makes a model “small”?
- No single official cutoff — surveys converge on a working range of ~100M–5B parameters.
- Also defined by capability and constraints: useful work that fits resource-constrained, on-device settings.
- “Small” is relative and keeps moving — measured against an ever-rising frontier.
Why a small model can punch above its weight
- Data quality over quantity: curated, “textbook-quality” and synthetic data lets small models deviate from raw scaling laws (the Phi line).
- Good architecture and training choices — well-trained compact models like TinyLlama and Gemma can outperform older same-size models.
- “Rivals a much larger model” is usually a vendor-reported benchmark, not independent fact.
Two ways to shrink a model: distillation & quantization
- Distillation shrinks the model: a small “student” learns from a large “teacher’s” soft probabilities.
- Quantization shrinks how weights are stored: lower precision (16-bit → 8/4/lower bit) cuts size, memory, and latency.
- Methods like GPTQ, AWQ, and QLoRA keep the quality cost small — the levers behind on-device deployment.
Why run AI on-device at all?
- Privacy: your input never has to leave the device.
- Latency: no network round-trip, so responses can feel more immediate.
- Offline: it keeps working with no connection.
- Cost: no per-call cloud bill — the tradeoff is a smaller model using the device’s compute and battery.
Continue your path
Where to go next
You just finished Small Language Models & On-Device AI. Here’s a natural progression — from what builds directly on it to where to go deeper.
What MCP is, how hosts, clients and servers connect, and why it matters.
Agentic~10 min
AI Agents
+What you’ll learnHide
How agents perceive, reason, use tools and act, and how they differ from chatbots.
Open lesson →
Agentic~8 min
RAG
+What you’ll learnHide
How retrieval grounds LLM answers, step by step.
Open lesson →
Agentic~7 min
Chatbots
+What you’ll learnHide
How they understand and respond, their limits, and how they differ from agents.
Open lesson →
Agentic~8 min
Model cards
+What you’ll learnHide
What they document, why they matter for transparency, and how to read one.
Open lesson →
Agentic~12 min
Inference Optimization (KV-Cache, Batching, Latency)
+What you’ll learnHide
Continue with Inference Optimization (KV-Cache, Batching, Latency).
Open lesson →
Agentic~12 min
AI Cost Optimization (FinOps for LLMs)
+What you’ll learnHide
Continue with AI Cost Optimization (FinOps for LLMs).
Open lesson →Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts grounded in the primary references below. The small/on-device model landscape moves quickly — model names, parameter counts, and "state-of-the-art" claims date fast, and parity/benchmark figures are vendor-reported. Treat the linked sources as the live source of truth.
- Small Language Models: Survey, Measurements, and Insights — Lu et al.
- A Comprehensive Survey of Small Language Models — Wang et al.
- Phi-3 Technical Report (3.8B Phi-3-mini) — Abdin et al., Microsoft
- Textbooks Are All You Need (phi-1) — Gunasekar et al., Microsoft Research
- TinyLlama: An Open-Source Small Language Model — Zhang et al.
- Gemma: Open Models Based on Gemini Research — Google DeepMind
- Gemma 2 (2B/9B trained via distillation) — Google DeepMind
- MobileLLM: Sub-billion Parameter Models for On-Device Use — Meta (ICML 2024)
- Distilling the Knowledge in a Neural Network — Hinton, Vinyals & Dean
- GPTQ: Accurate Post-Training Quantization — Frantar et al.
- AWQ: Activation-aware Weight Quantization — Lin et al. (MLSys 2024)
- QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al.
- GGUF and interaction with Transformers — Hugging Face
- llama.cpp — LLM inference in C/C++ (GGUF, 2–8 bit) — ggml-org
- Gemini Nano (Android AI / AICore) — Android Developers
- Apple Intelligence Foundation Language Models (Tech Report 2025) — Apple ML Research
- On-Device Language Models: A Comprehensive Review — arXiv:2409.00088
- NIST AI Risk Management Framework (AI RMF 1.0) — NIST
Small language models & on-device AI — in 8 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
What "small" means
A small language model (SLM) is a decoder-only transformer roughly in the 100M–5B parameter range — a survey convention, not a hard standard. It's also defined by fitting resource-constrained, on-device settings. "Small" is relative and keeps shifting as the frontier grows.
Punching above their weight
Small models challenge "bigger is better" via data quality (the Phi line's curated, textbook-quality data) and good architecture/training (TinyLlama, Gemma). Claims that a small model "rivals" a larger one are vendor-reported benchmarks, not independent proof.
Two ways to shrink a model
Distillation shrinks the model: a small student learns from a large teacher's soft probabilities. Quantization shrinks storage: weights are kept at lower precision (16-bit → 8/4-bit) via methods like GPTQ, AWQ, and QLoRA, cutting size and memory at some accuracy cost. GGUF is a storage format (Q8…Q2 levels), not a quantization algorithm.
Why on-device
Running locally gives privacy (data stays on device), lower latency (no round-trip), offline capability, and no per-call cost — trading away some capability and using the device's compute. Real examples: Apple's ~3B on-device model, Gemini Nano (Android/AICore), Meta's MobileLLM.