Foundations lesson

Track 01 · Foundations Intermediate ~9 min

How a network sees: convolutional neural networks

Feed an image to an ordinary neural network and it sees a flat list of pixels — position and shape get lost. CNNs fix that by sliding small learnable filters across the image, detecting edges and textures locally, then building those up into objects. This lesson shows you the convolution itself, step by step, right on the page.

Lesson progress

01Why ordinary networks struggle with images

A plain fully connected neural network treats its input as one long list of numbers. To feed it a picture, you have to flatten the grid of pixels into that list — and the moment you do, you throw away the thing that makes an image an image: which pixels sit next to which. An eye in the top-left corner and the same eye shifted slightly to the right look like completely different inputs. Worse, every pixel gets its own connection to every neuron, so even a small image needs an enormous number of weights to learn.

A convolutional neural network (CNN) is a neural network built specifically for grid-structured data such as images. Instead of connecting everything to everything, it slides small learnable filters across the input to produce feature maps — preserving spatial layout and reusing the same pattern detector everywhere it looks (LeCun et al., 1998; Goodfellow, Bengio & Courville, Ch. 9).

Spatial structure matters. In an image, a pixel only makes sense in the context of its neighbours — flattening destroys that.
Parameters explode. Fully connecting every pixel to every neuron is wasteful and hard to train.
CNNs are purpose-built for grids — they keep the 2-D layout and scan it with small filters.

02Three ideas that make a CNN work

Almost everything special about a CNN comes from three closely related ideas. They sound technical, but each one is just a practical answer to a problem the plain network had.

Local receptive fields. Each unit in a convolutional layer connects only to a small local region of the previous layer — its receptive field. The network learns local patterns first (an edge here, a corner there) before combining them into anything bigger.
Parameter sharing. The same filter weights are reused at every position on the image. A detector that finds a vertical edge in one corner finds it everywhere, which slashes the number of parameters versus a fully connected layer and gives the network translation equivariance — shift the input, and the feature map shifts the same way.
Pooling. Pooling layers (most commonly max or average pooling) downsample a feature map — summarising each little neighbourhood into a single value. That shrinks the data, saves computation, and adds a degree of translation invariance so small shifts no longer change the answer.

These ideas have a long pedigree: they are loosely inspired by the visual cortex hierarchy of simple and complex cells with local receptive fields (Hubel & Wiesel, 1962), an idea first turned into an architecture by Fukushima neocognitron (1980). The analogy is motivational, not a literal model of the brain.

03Watch a convolution happen

Here is the operation at the heart of every CNN. A small kernel (a grid of weights) is laid over a patch of the input — the receptive field. Each overlapping pair of numbers is multiplied, all the products are added up, and that single sum becomes one cell of the output feature map. Then the kernel slides over by the stride and does it again. Step through it below: the highlighted patch is what the kernel currently sees, and the multiply-add is written out in full.

InteractivePick a kernel, then step the window across

Kernel

Stride

Input · 6×6 pixels

Feature map

Window 1 of 16

A 3×3 kernel on a 6×6 input with stride 1 produces a 4×4 feature map; stride 2 produces 2×2 — the window jumps further, so there are fewer output cells. Pixel values are illustrative (a 0–9 scale), and brightness is shown for readability.

Multiply, add, place one cell. Every feature-map cell is one dot product between the kernel and the patch beneath it.
The kernel is shared. The exact same nine weights are reused at every window — that is parameter sharing in action.
Stride and kernel size set the output size. A bigger stride skips positions, producing a smaller feature map (Dumoulin & Visin, 2016).

04From edges to objects: stacking layers

One convolution finds simple patterns. The power comes from stacking them. A typical CNN alternates a convolution plus a nonlinear activation with a pooling layer, again and again, then finishes with one or more fully connected layers (or a global pooling layer) that turn the final features into a prediction. Because each layer feeds on the layer below, the network learns a feature hierarchy: early layers detect edges and textures, and deeper layers combine those into parts and whole objects. The milestone architectures below are points on that same idea, not a single fixed design.

InteractiveStep through the lineage

LeNet-5 — the blueprint

LeCun and colleagues used gradient-based training end-to-end to recognise handwritten digits. It established the canonical pattern still in use: convolution and subsampling (pooling) layers for feature extraction, followed by fully connected layers for the final decision.

takeaway: the conv → pool → fully connected stack, trained by gradient descent, was born here.

AlexNet — deep learning goes mainstream

Krizhevsky, Sutskever and Hinton showed a deep CNN could win large-scale image classification on ImageNet, using ReLU activations, dropout for regularisation, and GPU training to make depth practical. This result is widely credited with kicking off the modern deep-learning era for vision.

takeaway: ReLU + dropout + GPUs made deep CNNs trainable at scale.

VGG — depth from small filters

Simonyan and Zisserman showed that stacking many small (3×3) convolutional filters, layer on layer, was an effective way to build very deep, uniform networks — a design principle that influenced almost everything after it.

takeaway: many small 3×3 filters, stacked deep, beat a few large ones.

ResNet — making very deep trainable

He and colleagues added residual (skip) connections that let a layer pass its input forward unchanged. This eases gradient flow through the network, making it possible to train networks far deeper than before without them degrading. (GoogLeNet/Inception, from the same era, attacked efficiency with multi-scale "inception" modules.)

takeaway: skip connections let gradients flow, unlocking very deep CNNs.

05Where CNNs stand today

CNNs drove a decade of progress in computer vision, powered by large benchmark datasets such as ImageNet. They remain a strong, efficient default for many image tasks, and their core ideas — local receptive fields, parameter sharing, pooling — show up well beyond images. But they are no longer the only state of the art: Vision Transformers and other attention-based architectures now rival or exceed CNNs on many vision benchmarks. The right choice depends on the task, the data you have, and your compute budget — not on any single architecture being universally best.

Still a strong default for image classification, detection, and segmentation, especially when efficiency matters.
Not the only option. Attention-based models (Vision Transformers) compete with or beat CNNs on many benchmarks.
The ideas generalise. Local pattern detection and weight sharing apply to audio, time series, and more.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Convolutional neural networks" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

How neural networks work

The building block under every CNN — neurons, layers, weights, and the forward pass.

Read →

Live lesson

Computer vision, explained

Where CNNs are put to work: classifying, detecting, and segmenting what is in an image.

Read →

▸ Related foundations

Live lesson

How transformers work

The attention-based architecture that now rivals CNNs on many vision tasks.

Read →

Live lesson

AI vs machine learning vs deep learning

Zoom out: where deep learning — and CNNs — sit inside the bigger picture.

Read →

⊕Concept map

Expand each branch to see how a CNN is built up — from why images break ordinary networks, through the convolution itself, to where CNNs stand today.

Why ordinary networks struggle with images

Flattening a grid of pixels into one long list throws away which pixels sit next to which — the spatial structure.
Fully connecting every pixel to every neuron makes parameters explode and the network hard to train.
A CNN is purpose-built for grid data: it slides small learnable filters across the input to make feature maps.

Three ideas that make a CNN work

Local receptive fields — each unit connects only to a small local region, learning local patterns first.
Parameter sharing — the same filter weights are reused at every position, slashing parameters and giving translation equivariance.
Pooling (max / average) downsamples feature maps, saving compute and adding a degree of translation invariance.

Watch a convolution happen

Every feature-map cell is one dot product between the kernel and the patch beneath it — multiply, add, place one cell.
The exact same kernel weights are reused at every window — parameter sharing in action.
Stride and kernel size set the output size: a 3×3 kernel on a 6×6 input gives 4×4 at stride 1, 2×2 at stride 2.

From edges to objects: stacking layers

A typical CNN alternates convolution + activation with pooling, then ends with fully connected (or global pooling) layers.
Stacking builds a feature hierarchy: early layers detect edges and textures, deeper layers detect parts and whole objects.
Milestones on this idea: LeNet (1998), AlexNet (2012), VGG (2014), and ResNet (2015) with skip connections.

Where CNNs stand today

Still a strong, efficient default for image classification, detection, and segmentation.
No longer the only option — Vision Transformers now rival or exceed CNNs on many vision benchmarks.
The core ideas (local detection, weight sharing) generalize well beyond images, to audio and time series.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; pixel values and grids shown in the interactive are illustrative and labelled as such.

Gradient-Based Learning Applied to Document Recognition (LeNet-5) — LeCun, Bottou, Bengio & Haffner (Proc. IEEE, 1998)
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) — Krizhevsky, Sutskever & Hinton (NeurIPS, 2012)
Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG) — Simonyan & Zisserman (2014)
Deep Residual Learning for Image Recognition (ResNet) — He, Zhang, Ren & Sun (2015)
A guide to convolution arithmetic for deep learning — Dumoulin & Visin (2016)
Deep Learning — Ch. 9: Convolutional Networks — Goodfellow, Bengio & Courville
CS231n — Convolutional Networks — Stanford CS231n
Dive into Deep Learning — Convolutional Neural Networks — Zhang, Lipton, Li & Smola
torch.nn.Conv2d — PyTorch
ImageNet — Deng, Dong, Socher, Li, Li & Fei-Fei

Responsible use & transparency

This is an educational explainer, not professional or technical advice. Computer-vision systems, including CNNs, can be biased or wrong, and their outputs should be reviewed by a qualified person before they inform decisions that affect people. For guidance on managing these risks, see the NIST AI Risk Management Framework.

Tech Jacks Solutions maintains editorial independence; the references above are cited for accuracy, not endorsement. Figures, pixel values, and diagrams in the interactive are illustrative and clearly labelled. If you spot an error, we welcome corrections.

Convolutional neural networks — in 9 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What a CNN is

A neural network built for grid-structured data such as images. It slides small learnable filters (kernels) across the input to produce feature maps, preserving spatial layout instead of flattening it away.

Three core ideas

Local receptive fields — each unit connects to a small local region. Parameter sharing — the same filter weights are reused at every position, cutting parameters and giving translation equivariance. Pooling — downsamples feature maps for some translation invariance and less computation.

The convolution operation

Lay the kernel over a patch, multiply each overlapping pair, add the products: that sum is one feature-map cell. Slide by the stride and repeat. Output size = (input − kernel)/stride + 1, so a 3×3 kernel on a 6×6 input at stride 1 gives a 4×4 map.

Feature hierarchy & milestones

Stacked conv+pool layers learn edges first, then parts, then objects. Milestones: LeNet (1998), AlexNet (2012, ReLU+dropout+GPUs), VGG (2014, small 3×3 filters), ResNet (2015, skip connections for very deep nets).

Where they stand

Still a strong, efficient default for images, but attention-based Vision Transformers now rival or exceed CNNs on many benchmarks. The brain analogy is motivational, not literal.

Gallery

Contacts

How a network sees: convolutional neural networks

01Why ordinary networks struggle with images

02Three ideas that make a CNN work

03Watch a convolution happen