Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Agentic learning lesson
Track 03 · Agentic Advanced ~9 min

How vector search gets fast: HNSW, IVF & PQ

When a retrieval system or AI agent searches millions of embeddings, comparing the query to every single one is too slow. Vector index algorithms trade a tiny, tunable bit of accuracy for enormous speed and memory savings. Learn how the three core ideas — graphs (HNSW), cell-pruning (IVF), and compression (PQ) — work, and watch the speed-vs-recall trade-off happen on the page.

Module progress
0%

01The problem: exact search doesn't scale

Semantic search and retrieval-augmented agents turn text into vectors (embeddings) and find the items whose vectors sit closest to the query's vector. The simplest way to do that is brute force: compare the query against every stored vector and return the true closest ones. This is called exact or flat search, and it returns the genuinely nearest neighbors — but its cost grows with the size of the dataset, so checking every vector becomes too slow once you have millions or billions of them.

The fix is Approximate Nearest Neighbor (ANN) search. Instead of checking everything, an ANN index prunes the search space — it looks at only a clever fraction of the data. That makes queries far faster (and, with compression, far lighter on memory), at the cost of occasionally missing a true neighbor. We measure that miss rate with recall: the fraction of the genuine nearest neighbors the index actually returns. The whole field is about sliding along one frontier — recall vs. latency vs. memory — and the knobs each algorithm gives you to choose your spot on it.

  • Exact / flat search compares the query to every vector and returns the true neighbors — but the cost grows with the dataset, so it gets slow at scale.
  • ANN trades a small, tunable drop in recall for big gains in speed and memory by searching only part of the data.
  • There's no free lunch: every index is a position on the recall vs. latency vs. memory frontier, set by its tuning knobs.

02HNSW: navigate a graph of neighbors

Hierarchical Navigable Small World (HNSW) builds a multi-layer proximity graph: every vector becomes a node connected to some of its nearest neighbors. Each node is randomly assigned a maximum layer drawn from an exponentially decaying distribution, so the upper layers are sparse "express lanes" and the bottom layer (layer 0) holds all the points. A query enters at the top layer's entry point and greedily hops toward closer and closer nodes, dropping down a layer each time it can't get closer, until it refines among near neighbors at the bottom. Because each descent covers a lot of ground, search cost scales roughly logarithmically with dataset size.

Three knobs shape it: M (how many neighbor links each node keeps), efConstruction (how hard the builder works to find good links — higher means a better graph but a slower build), and efSearch / ef (how big the candidate list is at query time — higher means better recall but higher latency). Different engines rename these — Qdrant calls query-time effort hnsw_ef, Elasticsearch calls it num_candidates — but the idea is identical. HNSW delivers state-of-the-art recall-for-speed, with one catch: it's memory-heavy, because it stores the whole graph plus (usually) the original vectors.

InteractiveSwitch the strategy · click the cloud to move the query
query visited true neighbors not checked
Points visited
vs brute force
Pick a strategy and press Run search to see how many points it checks and how its results compare to the exact answer.

Illustrative 2-D demo. Real embeddings live in hundreds or thousands of dimensions and indexes hold millions of vectors; the point counts here are scaled down so the trade-off is visible. The "recall" shown is computed live against this demo's own brute-force result.

  • HNSW is a layered graph: sparse upper "express lanes" plus a bottom layer containing every point; search greedily hops toward the query.
  • Tune it with M (links per node), efConstruction (build effort), and efSearch (query-time effort — the main recall/latency dial).
  • Top-tier recall and speed, but memory-heavy unless paired with compression.

03IVF: only search the nearest cells

Inverted File (IVF) takes a different route: partition the space first. A "coarse quantizer" — usually k-means with nlist centroids — divides the dataset into nlist clusters (think Voronoi cells). Each vector is filed into the inverted list of its nearest centroid. At query time the index finds the query's nearest centroids and scans only the nprobe closest cells, ignoring the rest. That's non-exhaustive search: instead of every vector, you touch only the handful that live in the cells nearest the query.

The two knobs trade off cleanly. A larger nlist makes finer cells (each holds fewer points, so scans are faster, but you must train more centroids up front). A larger nprobe scans more cells, raising recall at the cost of latency — set nprobe to nlist and you're back to exact search. IVF needs a training step on representative data before you add vectors, because the centroids have to be learned first. On its own (IVF_FLAT stores vectors uncompressed in the lists) it uses modest extra memory; its real power shows up when combined with compression, below.

  • IVF clusters the data into nlist cells; queries scan only the nprobe nearest cells — fewer comparisons, lower latency.
  • nlist sets cell granularity; nprobe is the recall dial (more cells probed = higher recall, slower query).
  • Requires a one-time training step to learn the centroids before vectors are added.

04PQ & OPQ: compress the vectors, then combine

Product Quantization (PQ) attacks memory rather than speed. It splits each vector into m equal sub-vectors and learns a small codebook for each subspace (commonly 256 centroids, so each sub-vector becomes a single byte). A vector is then stored as just m bytes — a tiny "code" — and distances are estimated in the compressed domain using precomputed lookup tables. Comparing an uncompressed query against compressed database codes (asymmetric distance computation, ADC) is more accurate than compressing both sides. PQ shrinks memory dramatically; the price is approximation error from the compression.

Optimized Product Quantization (OPQ) improves on PQ by first learning a rotation of the space so variance is balanced across the sub-vectors before they're quantized — better recall at the same code length. In practice the big three combine: IVFPQ / IVFADC uses IVF to restrict the search to nprobe cells and PQ to compress the vectors inside them, which is the workhorse for billion-scale, memory-bounded search (FAISS even lets you stack them in one string like OPQ16_64,IVF4096,PQ16). Google's ScaNN pipelines a partitioning step, an anisotropic quantization that weights the error most relevant to ranking, and an optional exact re-ranking of the top candidates. And graph indexes have cousins too: NSG trims graph connections while staying connected, while DiskANN keeps a graph on SSD to reach billion-point scale on modest RAM.

IndexCore ideaStrengthMain cost
FlatCompare every vectorExact, simple, no trainingSlow at scale; full memory
HNSWNavigable layered graphTop recall & speedHigh memory
IVFCluster into cells, probe a fewCuts query cost; modest memoryNeeds training; recall drops if nprobe too low
PQ / OPQCompress vectors to short codesMinimal memoryApproximation error
IVFPQIVF cells + PQ compression comboBillion-scale, memory-boundedMore tuning; layered error
  • PQ compresses each vector into a few bytes of codebook indices; OPQ rotates the space first for better recall at the same size.
  • IVFPQ / IVFADC combines cell-pruning with compression — the standard recipe for billion-scale, memory-bounded search.
  • Related ideas — ScaNN (anisotropic quantization + re-ranking), NSG, and DiskANN — push the same recall/latency/memory frontier in different directions.

05Choosing: it's a budget, not a winner

There is no single "best" index — there's the index that fits your budget. If everything fits in RAM and you want the best recall-for-latency, HNSW is the common default (it's what pgvector, Weaviate, Qdrant, Milvus, OpenSearch and Elasticsearch expose for dense-vector ANN). If memory is the binding constraint at very large scale, IVFPQ (or OPQ+IVF+PQ) trades some recall for a far smaller footprint; DiskANN pushes that further by living on SSD. Whatever you pick, the tuning dials — efSearch, nprobe, code length m, nlist — let you slide along the frontier after the fact.

Two honest cautions. First, parameter names differ by engine for the same concept, so verify against your specific tool's docs before quoting. Second, benchmark numbers are not universal: recall, latency, and throughput depend heavily on the dataset, dimensionality, hardware, and settings, so treat any single figure as a configuration-specific data point, not a law. The durable takeaway is the shape of the trade-off, not a leaderboard.

  • Fits in RAM, want best recall/latency → HNSW. Memory-bounded at scale → IVFPQ; SSD-scale → DiskANN.
  • Tuning knobs let you move along the frontier post-build — you don't have to rebuild to trade recall for speed.
  • Verify parameter names per engine, and treat benchmark figures as configuration-specific, never universal.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Vector index algorithms" — one-page summary
HNSW, IVF, and PQ distilled to a printable cheat-sheet.
▸ Related lessons — go deeper

Sources & further reading

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established algorithms and is grounded in the primary papers and official documentation below. The 2-D visualizer is illustrative and labelled as such; any performance figures depend on dataset, dimensionality, hardware, and settings, so treat them as configuration-specific. This is an educational resource, not engineering or operational advice — validate index choices and parameters against your own data and your engine's documentation before relying on them.

Concept map

Expand each branch to see how the core ideas in this lesson connect.

Exact vs. approximate search
  • Exact (flat) search compares the query against every vector, returning the true neighbors, but its cost grows with the dataset.
  • Approximate Nearest Neighbor (ANN) search prunes the search space, trading a small, tunable drop in recall for big speed & memory gains.
  • Recall measures the fraction of genuine nearest neighbors an index actually returns.
  • Every index sits somewhere on the recall vs. latency vs. memory frontier.
HNSW — navigate a graph
  • A multi-layer proximity graph: sparse upper “express lanes” plus a bottom layer holding every point.
  • A query enters at the top and greedily hops toward closer nodes, descending layers to refine.
  • Knobs: M (links per node), efConstruction (build effort), efSearch (query-time recall/latency dial).
  • Top recall and speed, but memory-heavy unless paired with compression.
IVF — search only the nearest cells
  • A coarse quantizer clusters the data into nlist cells; each vector is filed under its nearest centroid.
  • Queries scan only the nprobe nearest cells — non-exhaustive search that cuts comparisons.
  • nlist sets cell granularity; nprobe is the recall dial (more cells = higher recall, slower query).
  • Requires a one-time training step to learn the centroids before vectors are added.
PQ & OPQ — compress, then combine
  • Product Quantization splits each vector into m sub-vectors and encodes each with a codebook, shrinking it to a short code.
  • Asymmetric Distance Computation (ADC) compares an uncompressed query against compressed codes for better accuracy.
  • OPQ first rotates the space so variance is balanced across sub-vectors, improving recall at the same code length.
  • IVFPQ / IVFADC combines cell-pruning with compression — the workhorse for billion-scale, memory-bounded search.
Choosing — a budget, not a winner
  • Fits in RAM, want best recall-for-latency → HNSW (the common default across most vector databases).
  • Memory-bounded at very large scale → IVFPQ; SSD-scale → DiskANN.
  • Tuning dials (efSearch, nprobe, code length m, nlist) let you slide along the frontier after the build.
  • Parameter names differ by engine, and benchmark figures are configuration-specific, never universal.

Vector index algorithms — HNSW, IVF & PQ in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

The core trade-off

Exact (flat) search compares the query to every vector and returns the true neighbors, but cost grows with the dataset. Approximate Nearest Neighbor (ANN) indexes trade a small, tunable drop in recall for big gains in speed and/or memory. Every index is a point on the recall vs. latency vs. memory frontier.

HNSW — graph

A multi-layer navigable graph: sparse upper "express lanes" plus a bottom layer with every point. Queries greedily hop toward closer nodes. Knobs: M (links/node), efConstruction (build effort), efSearch (query-time recall/latency dial). Top recall/speed, but memory-heavy.

IVF — cells

Cluster the data into nlist cells (k-means centroids); a query scans only the nprobe nearest cells. nlist sets granularity; nprobe is the recall dial. Requires a training step first.

PQ & OPQ — compression

PQ splits each vector into m sub-vectors and encodes each with a small codebook (often 1 byte each), shrinking memory; ADC compares an uncompressed query to compressed codes. OPQ rotates the space first for better recall at the same code length.

Combining & choosing

IVFPQ/IVFADC = cell pruning + compression, the workhorse for billion-scale memory-bounded search. ScaNN, NSG, and DiskANN push the same frontier differently. Fits in RAM → HNSW; memory-bounded at scale → IVFPQ; SSD-scale → DiskANN. Parameter names differ per engine and benchmark numbers are configuration-specific — verify both.