How does a machine see?
A photo, to a computer, is just a grid of numbers. Computer vision is how we teach machines to turn that grid into meaning — to say what is in a picture, where it is, and which pixels belong to it. Walk through how a model "sees," the three core tasks, and where it still gets fooled — right here on the page.
01An image is just a grid of numbers
Imagine a sheet of graph paper where every little square is filled in with a shade of gray. A black-and-white photo on a computer is exactly that: a grid of tiny squares called pixels, and each pixel is simply a number that says how bright that spot is. A color photo is the same idea, just with three numbers per pixel — one for red, one for green, one for blue (together, RGB). So a picture the size of your phone screen is really millions of numbers stacked in a grid. That's the whole starting point: to a machine, "seeing" means doing math on that grid of numbers. The hard part is that the same object — a cat in sunlight, a cat in shadow, a cat turned sideways — becomes a wildly different set of numbers each time, so the machine has to learn patterns that survive all that change.
- A pixel is one tiny square of the image, stored as a number (brightness) — or three numbers for color (R, G, B).
- A whole image is a grid: height × width × channels. Vision is math performed on that grid.
- The challenge: the same object maps to different numbers under different light, angle, and position — so meaning lives in patterns, not single pixels.
02Why CNNs work: filters that build up from edges to objects
If meaning lives in patterns, how does a machine find them? The workhorse of vision is the convolutional neural network (CNN). The trick is a small filter — a tiny detector that slides across the whole image looking for one simple thing, like an edge. Because the same filter is reused everywhere, it can spot that pattern no matter where it appears. Stack many layers of filters and something powerful happens: early layers find edges, the next combine edges into textures, then into shapes, and finally into whole objects. Tap each stage to see what it does.
Sliding filter
A filter is a small window — say 3×3 pixels — that slides across the whole image. At each spot it checks for one local pattern. Crucially the same filter is reused at every position, so it can detect that pattern anywhere, using far fewer numbers than connecting every pixel to every neuron.
- A filter is a small, reused detector that finds a local pattern anywhere in the image — that reuse (weight sharing) is what makes CNNs efficient.
- Depth builds a hierarchy: edges → textures → shapes → objects. Each layer composes the simpler features below it.
03Watch a model "see": from pixels to a label
Let's put it together. Below is a tiny illustrative image — a grid of pixel numbers. Step through what a model does to it: first it runs edge detection, then it forms feature maps (where useful patterns light up), and finally it settles on a label — that's classification. Then flip the overlay to compare how the same picture is handled by object detection (a bounding box: what and where) versus segmentation (a pixel mask: which pixels belong to the object). Everything here is a simplified illustration, not a real model's output.
The model starts with nothing but a grid of numbers — brighter squares are higher values. There's no "cat" here yet, only pixels.
- Classification ends at a single label ("what is it"). Detection adds a box ("what + where"). Segmentation marks each pixel ("which pixels").
- The earlier stages (edges, feature maps) are shared groundwork — the same features can feed any of the three tasks.
04The three core tasks: what, what+where, which pixels
Almost everything in computer vision is a version of three questions. They differ in how precisely they pin down an object in the image — from a single word, to a box, to an exact outline. Switch between them to see the idea and a one-line example of each.
Classification — "what is it?"
The simplest task: assign one (or a few) labels to the whole image. It tells you a photo contains a dog, but not where the dog is. It's the foundation the other two tasks build on.
Object detection — "what AND where?"
Detection adds location: it draws a bounding box around each object and labels it. One image can have many boxes — useful for counting cars, finding faces, or spotting products on a shelf.
Segmentation — "which pixels belong to what?"
The most precise: label every pixel with what it belongs to, producing a mask that hugs the object's exact outline. Essential when a box is too coarse — e.g. separating road from sidewalk, or a tumor from healthy tissue.
05Reusing knowledge, Vision Transformers, and where models still fail
You rarely start from scratch. Training a strong vision model needs a lot of images, so practitioners lean on transfer learning: take a model already trained on a huge image collection, and adapt it to your task. The reusable part — a general feature extractor — is called a pretrained backbone; you attach a small task-specific "head" on top for classification, detection, or segmentation. With only a few hundred labeled images, fine-tuning a backbone almost always beats training a big network cold, which would just memorize the tiny dataset.
For most of the last decade, CNNs were the only game in town. More recently the field has shifted toward the Vision Transformer (ViT): instead of sliding filters, a ViT chops the image into small patches, treats each patch like a word, and uses attention — the same mechanism behind modern language models — to weigh how every patch relates to every other. That lets it capture long-range structure across the whole image at once.
However a model sees, it can still be wrong in revealing ways. Bias: a model inherits the skews of its training images, and can perform unevenly across groups or conditions it rarely saw. Adversarial examples: a tiny, often invisible-to-us change to an image can make a confident model misclassify it entirely. Domain shift: a model trained mostly on clear daytime photos may stumble at night or in fog, because the new images don't match what it learned from. These limits are why computer vision in the real world demands testing, monitoring, and human judgement — not blind trust.
06Check your understanding
07Take it with you & go deeper
How object detection works
Bounding boxes, anchors, and IoU — from sliding windows to modern single-shot detectors.
In the pipelineVision Transformers explained
Patches, attention, and how a ViT differs from a convolutional network.
In the pipeline→Continue learning
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.
- CS231n — Convolutional Neural Networks for Visual Recognition — Stanford
- CS231n — Image Classification notes — Stanford
- CS231n — Convolutional Networks notes — Stanford
- Deep Residual Learning for Image Recognition (ResNet) — He et al. (2015)
- An Image is Worth 16×16 Words (Vision Transformer) — Dosovitskiy et al. (2020)
- But what is a convolution? — 3Blue1Brown
- PyTorch vision tutorials — classifiers, detection, transfer learning
How a machine sees — computer vision in 5 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
An image is a grid of numbers
To a computer, a photo is a grid of pixels, each a number for brightness — or three numbers (R, G, B) for color. Vision is math performed on that grid. The hard part: the same object becomes different numbers under different lighting, angle, and position.
Why CNNs work
A convolutional neural network uses small filters that slide across the image, each detecting a local pattern anywhere. Stacked layers build a hierarchy: edges → textures → shapes → objects.
The three core tasks
Classification — what is it (a label). Object detection — what + where (bounding boxes). Segmentation — which pixels (a per-pixel mask).
Transfer learning & ViT
Transfer learning reuses a pretrained backbone (feature extractor) and fine-tunes it on a new task — better than training from scratch on little data. Vision Transformers (ViT) split an image into patches and use attention instead of convolution.
Limits
Bias from skewed training data; adversarial examples (tiny changes that fool the model); domain shift (worse on images unlike the training set). Real-world vision needs testing, monitoring, and human judgement.