PyTorch Tutorial for Beginners: From Tensors to Neural Networks
You don't need a math PhD to build a working neural network in PyTorch. You need four concepts: tensors (the data), autograd (the learning), nn.Module (the architecture), and the training loop (the process). This guide teaches all four , and ends with a complete digit classifier you can run yourself.
Prerequisites
Before writing any code, make sure your environment is ready. If you haven't installed PyTorch yet, follow the official install guide , it covers CUDA, macOS MPS, and CPU-only setups.
python -c "import torch; print(torch.__version__)" , should print 2.x or higher. See the install guide for version support detailspython --versionpip install torchvision , needed for the dataset loader in the final sectionThis tutorial builds progressively. Each section introduces one concept, shows it in code, and connects it to the next. By the end you'll have a complete working classifier.
What Is a Tensor?
A tensor is an n-dimensional array , the same idea as a NumPy array, except tensors can live on a GPU. A scalar is a 0D tensor. A vector is 1D. A matrix is 2D. An image batch might be 4D: [batch, channels, height, width]. That's the only new vocabulary you need.
The GPU speedup is what separates "runs in minutes" from "runs overnight." Moving to GPU takes one .to(device) call , PyTorch handles the math, the memory, and the result routing. The only constraint: your model and all its input tensors must be on the same device (covered in the GPU section).
import torch # Create from Python list x = torch.tensor([1.0, 2.0, 3.0]) print(x) # tensor([1., 2., 3.]) # Random tensor , shape (3, 4) y = torch.rand(3, 4) # Zeros and ones z = torch.zeros(2, 3) o = torch.ones(5) # Check shape, dtype, device print(y.shape) # torch.Size([3, 4]) print(y.dtype) # torch.float32 print(y.device) # cpu
Tensors support all the arithmetic you'd expect , addition, multiplication, matrix multiplication , and these operations compose cleanly with autograd, which we'll cover next.
# Element-wise ops a = torch.tensor([1.0, 2.0, 3.0]) b = torch.tensor([4.0, 5.0, 6.0]) print(a + b) # tensor([5., 7., 9.]) print(a * b) # tensor([ 4., 10., 18.]) # Matrix multiplication m1 = torch.rand(3, 4) m2 = torch.rand(4, 2) result = torch.matmul(m1, m2) # shape: (3, 2) # Reshape x = torch.arange(12).float() x = x.reshape(3, 4) # shape: (3, 4)
One attribute you'll see throughout training: requires_grad=True. When set on a tensor, PyTorch tracks every operation involving that tensor so it can compute gradients later. This is the bridge to autograd.
Autograd: How PyTorch Learns
Neural network training boils down to one question: how do I change my model's parameters to reduce the error? Autograd answers it automatically. You don't calculate derivatives by hand , PyTorch does it for you by watching your code run.
The computation graph is dynamic , it rebuilds every forward pass. This means your model architecture can change between iterations, which is why PyTorch is the dominant framework for research.
import torch # Mark tensor for gradient tracking x = torch.tensor(3.0, requires_grad=True) # Forward pass , builds the computation graph y = x ** 2 + 2 * x + 1 # y = x² + 2x + 1 # Backward pass , computes dy/dx y.backward() # Gradient stored in .grad (dy/dx = 2x + 2 = 8 at x=3) print(x.grad) # tensor(8.)
.backward() call adds to the .grad attribute rather than replacing it. In a training loop you must call optimizer.zero_grad() at the start of each step, or gradients from the previous batch will corrupt the current one.import torch x = torch.tensor(2.0, requires_grad=True) # First backward , grad = 2x = 4 loss = x ** 2 loss.backward() print(x.grad) # tensor(4.) # Second backward without zeroing , grad ACCUMULATES to 8 loss = x ** 2 loss.backward() print(x.grad) # tensor(8.) , wrong! should be 4. # Correct: zero before each backward x.grad.zero_() loss = x ** 2 loss.backward() print(x.grad) # tensor(4.) , correct
In practice you rarely call .backward() on individual tensors directly. You use a loss function and an optimizer, which wrap this logic cleanly. That's the training loop , the next section.
Building a Neural Network with nn.Module
PyTorch's torch.nn module provides all the building blocks you need. Every neural network in PyTorch is a Python class that inherits from nn.Module. That parent class does the heavy lifting: it tracks all learnable parameters, moves them with the model when you call .to(device), and plugs directly into optimizers.
The structure is always the same:
__init__, define your layers (nn.Linear,nn.Conv2d, etc.)forward, define how data flows through those layers, including activation functions
import torch import torch.nn as nn import torch.nn.functional as F class SimpleNet(nn.Module): def __init__(self): super().__init__() # Define layers , parameters are auto-tracked self.fc1 = nn.Linear(784, 256) # input → hidden 1 self.fc2 = nn.Linear(256, 128) # hidden 1 → hidden 2 self.fc3 = nn.Linear(128, 10) # hidden 2 → output def forward(self, x): # Define the data flow x = F.relu(self.fc1(x)) # apply ReLU after layer 1 x = F.relu(self.fc2(x)) # apply ReLU after layer 2 x = self.fc3(x) # raw logits , no activation on output return x model = SimpleNet() print(model) # Count trainable parameters total = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Parameters: {total:,}") # 235,146
Notice: the forward method doesn't explicitly enable gradient tracking , that's handled automatically because model parameters have requires_grad=True by default. The output of fc3 is a raw logit , an unnormalized score before any probability conversion. We leave it raw because CrossEntropyLoss expects logits; it applies log-softmax internally. You just write the data flow.
The SimpleNet example uses F.relu (from torch.nn.functional) as an inline activation. The MNIST example below uses self.relu = nn.ReLU() as a module attribute. These are functionally equivalent ; both apply the same ReLU activation. The module form (nn.ReLU()) is preferred when you want activations to appear in print(model) or when using hooks for inspection.
forward() when using CrossEntropyLoss , that loss function applies log-softmax internally. Applying softmax yourself first leads to double-application and numerically worse results.forward() directly. Call the model object like a function: output = model(x). PyTorch hooks (batch norm, dropout, etc.) only fire through the module's __call__, which wraps forward().The Training Loop
The training loop is the heartbeat of every neural network. You run it once per epoch , one full pass over your dataset. Inside, you process data in batches of typically 32–256 samples, update the model after each batch, and track your loss.
import torch import torch.nn as nn import torch.optim as optim # Assumes model, train_loader defined , see MNIST section for complete example num_epochs = 5 criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) for epoch in range(num_epochs): model.train() # training mode: activates dropout + batch norm statistics running_loss = 0.0 for inputs, labels in train_loader: optimizer.zero_grad() # step 1: clear gradients (BEFORE forward) outputs = model(inputs) # step 2: forward pass loss = criterion(outputs, labels) # step 3: compute loss loss.backward() # step 4: backward pass optimizer.step() # step 5: update weights running_loss += loss.item() avg_loss = running_loss / len(train_loader) print(f"Epoch {epoch+1}: loss = {avg_loss:.4f}")
The model.train() call isn't cosmetic , it activates dropout layers and batch normalization's training-mode behavior (tracking running stats). When you switch to evaluation, call model.eval() and wrap inference in torch.no_grad() to disable gradient tracking and cut memory usage. This pair is so common that the MNIST example below demonstrates exactly this pattern.
Note on zero_grad() placement: It must come before the forward pass , not after optimizer.step(). If you zero after the step, you clear the gradients while they're still being used. Zeroing before each forward pass ensures each batch starts clean with no carryover from the previous iteration.
nn.CrossEntropyLoss , combines log-softmax + negative log-likelihood. Binary classification: nn.BCEWithLogitsLoss. Regression: nn.MSELoss (mean squared error). Recommendation: start with Adam at lr=1e-3 , it's robust and rarely needs tuning early on.Complete MNIST Digit Classifier
MNIST is 70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels. It's been the "hello world" of neural networks since 1998. Here's a complete, runnable classifier using everything covered so far.
The architecture: 3 fully connected layers (784 → 256 → 128 → 10). Input is a flattened 28×28 pixel image. Output is 10 raw logits ; one per digit class.
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader # ── Data ───────────────────────────────────────────────────── transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean, std ]) full_train = datasets.MNIST(root='./data', train=True, download=True, transform=transform) test_data = datasets.MNIST(root='./data', train=False, transform=transform) # Split training set: 55k train, 5k validation from torch.utils.data import random_split train_data, val_data = random_split(full_train, [55000, 5000]) train_loader = DataLoader(train_data, batch_size=64, shuffle=True) val_loader = DataLoader(val_data, batch_size=64) test_loader = DataLoader(test_data, batch_size=64) # ── Model ───────────────────────────────────────────────────── class MNISTNet(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 256) self.fc2 = nn.Linear(256, 128) self.fc3 = nn.Linear(128, 10) self.relu = nn.ReLU() def forward(self, x): x = x.view(-1, 784) # flatten 28x28 → 784 x = self.relu(self.fc1(x)) x = self.relu(self.fc2(x)) return self.fc3(x) # raw logits model = MNISTNet() criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # ── Training + per-epoch validation ────────────────────────── for epoch in range(5): model.train() # training mode: enables dropout + batch norm stats for inputs, labels in train_loader: optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # Validate after each epoch to track progress model.eval() # eval mode: disables dropout, fixes batch norm val_correct = val_total = 0 with torch.no_grad(): # disable gradient tracking for speed for inputs, labels in val_loader: outputs = model(inputs) _, predicted = torch.max(outputs, 1) val_total += labels.size(0) val_correct += (predicted == labels).sum().item() print(f"Epoch {epoch+1}: val accuracy = {100*val_correct/val_total:.1f}%") # ── Evaluation ──────────────────────────────────────────────── model.eval() correct, total = 0, 0 with torch.no_grad(): for inputs, labels in test_loader: outputs = model(inputs) _, predicted = torch.max(outputs, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print(f"Test accuracy: {100 * correct / total:.2f}%") # Typical result: ~97% after 5 epochs (varies by run and hardware)
Run this script as-is ; torchvision.datasets.MNIST downloads the data automatically the first time. After 5 epochs (roughly 2–3 minutes on CPU), you should see ~97% test accuracy (results vary by hardware and random initialization). That's a working neural network. The normalization values (0.1307,) and (0.3081,) are MNIST-specific mean and standard deviation , for other datasets, compute them from your own data.
GPU Acceleration
Switching from CPU to GPU requires exactly three changes: select the device, move the model, move the data. That's it. PyTorch handles everything else ; no rewriting the model class, no changing the training loop logic.
import torch # Detect best available hardware if torch.cuda.is_available(): device = torch.device('cuda') # NVIDIA GPU elif torch.backends.mps.is_available(): device = torch.device('mps') # Apple Silicon else: device = torch.device('cpu') print(f"Using: {device}") # Move model to device model = MNISTNet().to(device) # In training loop , move each batch to device for inputs, labels in train_loader: inputs = inputs.to(device) labels = labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step()
RuntimeError: Expected all tensors to be on the same device. If you get this error, trace which tensor you forgot to .to(device). Common culprits: labels tensor, or a tensor you created inside forward() with a literal like torch.zeros(n) , those default to CPU.# Save model weights after training torch.save(model.state_dict(), 'mnist_model.pt') # Load on any device (map_location handles GPU → CPU and vice versa) model = MNISTNet() model.load_state_dict(torch.load('mnist_model.pt', map_location=device)) model.to(device) model.eval() # set to eval mode before inference
What to Learn Next
You've covered the complete foundation: tensors, autograd, nn.Module, the training loop, and GPU acceleration. The MNIST classifier is a real, working model. Where you go from here depends on what you want to build.
- Computer vision , Add convolutional layers (
nn.Conv2d) and exploretorchvision.modelsfor pretrained architectures like ResNet and EfficientNet - NLP and LLMs , Hugging Face Transformers builds on PyTorch. Start with
AutoModelForSequenceClassificationand fine-tune on your own data - Faster training , Enable
torch.compile(model)for significant speedup on supported models and hardware, or try mixed precision withtorch.autocast(speedup varies substantially by model architecture and GPU generation) - Scale up , PyTorch Lightning reduces boilerplate for multi-GPU training; TorchDistributed handles full cluster setups
The real-world applications , computer vision diagnostics, recommendation systems, on-device AI ; all use exactly the same five-step training loop you just learned. The building blocks don't change; the architectures get more complex.
pl.LightningModule, implement training_step() and configure_optimizers(), and Lightning handles device placement, logging, checkpointing, and multi-GPU training automatically. It does not replace the manual loop , it's a layer on top. Learn the manual loop first.