What is a PyTorch tensor?

A PyTorch tensor is an n-dimensional array conceptually identical to a NumPy array but GPU-acceleratable. Tensors can represent scalars (0D), vectors (1D), matrices (2D), and higher-dimensional data. Moving a tensor to a GPU can accelerate numeric computations by 50x or more (illustrative magnitude actual speedup varies by operation type and GPU model), making them the foundational data structure for deep learning.

How does PyTorch autograd work?

PyTorch autograd uses tape-based automatic differentiation. During the forward pass, PyTorch builds a directed acyclic graph (DAG) recording every tensor operation. When you call loss.backward(), PyTorch traverses the DAG in reverse and applies the chain rule to compute gradients for every parameter. Gradients are stored in the .grad attribute of leaf tensors.

What is nn.Module in PyTorch?

nn.Module is PyTorch's base class for all neural networks. You subclass it, define your layers (nn.Linear, nn.Conv2d, etc.) in __init__, and define the data flow through those layers in forward(). PyTorch automatically tracks all learnable parameters inside nn.Module subclasses, making it easy to update them with an optimizer.

What are the steps in a PyTorch training loop?

A standard PyTorch training loop has five steps: (1) optimizer.zero_grad() clear gradients from the previous step; (2) outputs = model(inputs) run the forward pass; (3) loss = criterion(outputs, labels) compute the loss; (4) loss.backward() compute gradients via autograd; (5) optimizer.step() update the model parameters.

How do you run a PyTorch model on GPU?

Use torch.device to select the hardware, then move both your model and data to that device with .to(device). On CUDA (NVIDIA): device = torch.device('cuda'). On Apple Silicon: device = torch.device('mps'). The model and all input tensors must be on the same device mixing devices causes a runtime error.

PYTORCH

PyTorch Tutorial for Beginners: From Tensors to Neural Networks

You don't need a math PhD to build a working neural network in PyTorch. You need four concepts: tensors (the data), autograd (the learning), nn.Module (the architecture), and the training loop (the process). This guide teaches all four , and ends with a complete digit classifier you can run yourself.

Prerequisites

Before writing any code, make sure your environment is ready. If you haven't installed PyTorch yet, follow the official install guide , it covers CUDA, macOS MPS, and CPU-only setups.

Before You Start

PyTorch installed

Run python -c "import torch; print(torch.__version__)" , should print 2.x or higher. See the install guide for version support details

Python 3.10 or later

PyTorch 2.x supports Python 3.10 – 3.13 (stable as of 2026; 3.14 is pre-release). Check with python --version

Basic Python fluency

You should be comfortable with classes, functions, and list operations , no advanced math required

torchvision (for MNIST example)

Install with pip install torchvision , needed for the dataset loader in the final section

This tutorial builds progressively. Each section introduces one concept, shows it in code, and connects it to the next. By the end you'll have a complete working classifier.

What Is a Tensor?

A tensor is an n-dimensional array , the same idea as a NumPy array, except tensors can live on a GPU. A scalar is a 0D tensor. A vector is 1D. A matrix is 2D. An image batch might be 4D: [batch, channels, height, width]. That's the only new vocabulary you need.

50×+

GPU acceleration over CPU for numeric tensor operations

PyTorch official documentation , illustrative of magnitude; varies by operation and GPU model

The GPU speedup is what separates "runs in minutes" from "runs overnight." Moving to GPU takes one .to(device) call , PyTorch handles the math, the memory, and the result routing. The only constraint: your model and all its input tensors must be on the same device (covered in the GPU section).

Python , creating tensors

import torch

# Create from Python list
x = torch.tensor([1.0, 2.0, 3.0])
print(x)          # tensor([1., 2., 3.])

# Random tensor ,  shape (3, 4)
y = torch.rand(3, 4)

# Zeros and ones
z = torch.zeros(2, 3)
o = torch.ones(5)

# Check shape, dtype, device
print(y.shape)    # torch.Size([3, 4])
print(y.dtype)    # torch.float32
print(y.device)   # cpu

Tensors support all the arithmetic you'd expect , addition, multiplication, matrix multiplication , and these operations compose cleanly with autograd, which we'll cover next.

Python , tensor operations

# Element-wise ops
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)       # tensor([5., 7., 9.])
print(a * b)       # tensor([ 4., 10., 18.])

# Matrix multiplication
m1 = torch.rand(3, 4)
m2 = torch.rand(4, 2)
result = torch.matmul(m1, m2)   # shape: (3, 2)

# Reshape
x = torch.arange(12).float()
x = x.reshape(3, 4)           # shape: (3, 4)

One attribute you'll see throughout training: requires_grad=True. When set on a tensor, PyTorch tracks every operation involving that tensor so it can compute gradients later. This is the bridge to autograd.

Autograd: How PyTorch Learns

Neural network training boils down to one question: how do I change my model's parameters to reduce the error? Autograd answers it automatically. You don't calculate derivatives by hand , PyTorch does it for you by watching your code run.

How Autograd Works , Three Concepts

DAG

Directed Acyclic Graph , computation graph built during the forward pass; every operation is a node

.backward()

Traverses the DAG in reverse, applying the chain rule to compute all gradients

.grad

Attribute where gradients accumulate on leaf tensors after backward()

The computation graph is dynamic , it rebuilds every forward pass. This means your model architecture can change between iterations, which is why PyTorch is the dominant framework for research.

Python , autograd basics

import torch

# Mark tensor for gradient tracking
x = torch.tensor(3.0, requires_grad=True)

# Forward pass ,  builds the computation graph
y = x ** 2 + 2 * x + 1   # y = x² + 2x + 1

# Backward pass ,  computes dy/dx
y.backward()

# Gradient stored in .grad  (dy/dx = 2x + 2 = 8 at x=3)
print(x.grad)     # tensor(8.)

Why zero_grad() matters

By default, PyTorch accumulates gradients ; each .backward() call adds to the .grad attribute rather than replacing it. In a training loop you must call optimizer.zero_grad() at the start of each step, or gradients from the previous batch will corrupt the current one.

Python , demonstrating gradient accumulation

import torch

x = torch.tensor(2.0, requires_grad=True)

# First backward ,  grad = 2x = 4
loss = x ** 2
loss.backward()
print(x.grad)     # tensor(4.)

# Second backward without zeroing ,  grad ACCUMULATES to 8
loss = x ** 2
loss.backward()
print(x.grad)     # tensor(8.) ,  wrong! should be 4.

# Correct: zero before each backward
x.grad.zero_()
loss = x ** 2
loss.backward()
print(x.grad)     # tensor(4.) ,  correct

In practice you rarely call .backward() on individual tensors directly. You use a loss function and an optimizer, which wrap this logic cleanly. That's the training loop , the next section.

Building a Neural Network with nn.Module

PyTorch's torch.nn module provides all the building blocks you need. Every neural network in PyTorch is a Python class that inherits from nn.Module. That parent class does the heavy lifting: it tracks all learnable parameters, moves them with the model when you call .to(device), and plugs directly into optimizers.

The structure is always the same:

__init__ , define your layers (nn.Linear, nn.Conv2d, etc.)
forward , define how data flows through those layers, including activation functions

Python , defining a neural network

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers ,  parameters are auto-tracked
        self.fc1 = nn.Linear(784, 256)   # input → hidden 1
        self.fc2 = nn.Linear(256, 128)   # hidden 1 → hidden 2
        self.fc3 = nn.Linear(128, 10)    # hidden 2 → output

    def forward(self, x):
        # Define the data flow
        x = F.relu(self.fc1(x))           # apply ReLU after layer 1
        x = F.relu(self.fc2(x))           # apply ReLU after layer 2
        x = self.fc3(x)                   # raw logits ,  no activation on output
        return x

model = SimpleNet()
print(model)

# Count trainable parameters
total = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Parameters: {total:,}")  # 235,146

Notice: the forward method doesn't explicitly enable gradient tracking , that's handled automatically because model parameters have requires_grad=True by default. The output of fc3 is a raw logit , an unnormalized score before any probability conversion. We leave it raw because CrossEntropyLoss expects logits; it applies log-softmax internally. You just write the data flow.

The SimpleNet example uses F.relu (from torch.nn.functional) as an inline activation. The MNIST example below uses self.relu = nn.ReLU() as a module attribute. These are functionally equivalent ; both apply the same ReLU activation. The module form (nn.ReLU()) is preferred when you want activations to appear in print(model) or when using hooks for inspection.

💡

Raw logits on the output layer

Don't apply softmax in forward() when using CrossEntropyLoss , that loss function applies log-softmax internally. Applying softmax yourself first leads to double-application and numerically worse results.

⚠

Common nn.Module misconception

You never call forward() directly. Call the model object like a function: output = model(x). PyTorch hooks (batch norm, dropout, etc.) only fire through the module's __call__, which wraps forward().

The Training Loop

The training loop is the heartbeat of every neural network. You run it once per epoch , one full pass over your dataset. Inside, you process data in batches of typically 32–256 samples, update the model after each batch, and track your loss.

The Five Steps , Every Batch, Every Epoch

optimizer.zero_grad()

Clear gradients left over from the previous batch , otherwise they accumulate

outputs = model(inputs)

Forward pass , data flows through the network, the computation graph is built

loss = criterion(outputs, labels)

Compute the loss , how wrong are the predictions? CrossEntropyLoss for classification

loss.backward()

Backward pass , traverse the DAG in reverse, compute gradients for every parameter

optimizer.step()

Update weights using the computed gradients , SGD, Adam, etc. each apply their own formula

Python , training loop template (complete example in MNIST section below)

import torch
import torch.nn as nn
import torch.optim as optim

# Assumes model, train_loader defined ,  see MNIST section for complete example
num_epochs = 5
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(num_epochs):
    model.train()   # training mode: activates dropout + batch norm statistics
    running_loss = 0.0

    for inputs, labels in train_loader:
        optimizer.zero_grad()               # step 1: clear gradients (BEFORE forward)
        outputs = model(inputs)             # step 2: forward pass
        loss = criterion(outputs, labels)   # step 3: compute loss
        loss.backward()                     # step 4: backward pass
        optimizer.step()                    # step 5: update weights
        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch+1}: loss = {avg_loss:.4f}")

The model.train() call isn't cosmetic , it activates dropout layers and batch normalization's training-mode behavior (tracking running stats). When you switch to evaluation, call model.eval() and wrap inference in torch.no_grad() to disable gradient tracking and cut memory usage. This pair is so common that the MNIST example below demonstrates exactly this pattern.

Note on zero_grad() placement: It must come before the forward pass , not after optimizer.step(). If you zero after the step, you clear the gradients while they're still being used. Zeroing before each forward pass ensures each batch starts clean with no carryover from the previous iteration.

Choosing a loss function

Classification: nn.CrossEntropyLoss , combines log-softmax + negative log-likelihood. Binary classification: nn.BCEWithLogitsLoss. Regression: nn.MSELoss (mean squared error). Recommendation: start with Adam at lr=1e-3 , it's robust and rarely needs tuning early on.

Complete MNIST Digit Classifier

MNIST is 70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels. It's been the "hello world" of neural networks since 1998. Here's a complete, runnable classifier using everything covered so far.

The architecture: 3 fully connected layers (784 → 256 → 128 → 10). Input is a flattened 28×28 pixel image. Output is 10 raw logits ; one per digit class.

Python , complete MNIST classifier

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ── Data ─────────────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))   # MNIST mean, std
])

full_train = datasets.MNIST(root='./data', train=True,
                             download=True, transform=transform)
test_data  = datasets.MNIST(root='./data', train=False,
                             transform=transform)

# Split training set: 55k train, 5k validation
from torch.utils.data import random_split
train_data, val_data = random_split(full_train, [55000, 5000])

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_data,   batch_size=64)
test_loader  = DataLoader(test_data,  batch_size=64)

# ── Model ─────────────────────────────────────────────────────
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(-1, 784)          # flatten 28x28 → 784
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.fc3(x)             # raw logits

model     = MNISTNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# ── Training + per-epoch validation ──────────────────────────
for epoch in range(5):
    model.train()    # training mode: enables dropout + batch norm stats
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss    = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    # Validate after each epoch to track progress
    model.eval()     # eval mode: disables dropout, fixes batch norm
    val_correct = val_total = 0
    with torch.no_grad():    # disable gradient tracking for speed
        for inputs, labels in val_loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            val_total   += labels.size(0)
            val_correct += (predicted == labels).sum().item()
    print(f"Epoch {epoch+1}: val accuracy = {100*val_correct/val_total:.1f}%")

# ── Evaluation ────────────────────────────────────────────────
model.eval()
correct, total = 0, 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs    = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total     += labels.size(0)
        correct   += (predicted == labels).sum().item()

print(f"Test accuracy: {100 * correct / total:.2f}%")
# Typical result: ~97% after 5 epochs (varies by run and hardware)

Run this script as-is ; torchvision.datasets.MNIST downloads the data automatically the first time. After 5 epochs (roughly 2–3 minutes on CPU), you should see ~97% test accuracy (results vary by hardware and random initialization). That's a working neural network. The normalization values (0.1307,) and (0.3081,) are MNIST-specific mean and standard deviation , for other datasets, compute them from your own data.

GPU Acceleration

Switching from CPU to GPU requires exactly three changes: select the device, move the model, move the data. That's it. PyTorch handles everything else ; no rewriting the model class, no changing the training loop logic.

Python , device-agnostic training

import torch

# Detect best available hardware
if torch.cuda.is_available():
    device = torch.device('cuda')       # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device('mps')        # Apple Silicon
else:
    device = torch.device('cpu')

print(f"Using: {device}")

# Move model to device
model = MNISTNet().to(device)

# In training loop ,  move each batch to device
for inputs, labels in train_loader:
    inputs = inputs.to(device)
    labels = labels.to(device)

    optimizer.zero_grad()
    outputs = model(inputs)
    loss    = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

The model and all its input tensors must be on the same device

Mixing devices causes a runtime error: RuntimeError: Expected all tensors to be on the same device. If you get this error, trace which tensor you forgot to .to(device). Common culprits: labels tensor, or a tensor you created inside forward() with a literal like torch.zeros(n) , those default to CPU.

Python , save and load model checkpoint

# Save model weights after training
torch.save(model.state_dict(), 'mnist_model.pt')

# Load on any device (map_location handles GPU → CPU and vice versa)
model = MNISTNet()
model.load_state_dict(torch.load('mnist_model.pt', map_location=device))
model.to(device)
model.eval()    # set to eval mode before inference

What to Learn Next

You've covered the complete foundation: tensors, autograd, nn.Module, the training loop, and GPU acceleration. The MNIST classifier is a real, working model. Where you go from here depends on what you want to build.

Computer vision , Add convolutional layers (nn.Conv2d) and explore torchvision.models for pretrained architectures like ResNet and EfficientNet
NLP and LLMs , Hugging Face Transformers builds on PyTorch. Start with AutoModelForSequenceClassification and fine-tune on your own data
Faster training , Enable torch.compile(model) for significant speedup on supported models and hardware, or try mixed precision with torch.autocast (speedup varies substantially by model architecture and GPU generation)
Scale up , PyTorch Lightning reduces boilerplate for multi-GPU training; TorchDistributed handles full cluster setups

The real-world applications , computer vision diagnostics, recommendation systems, on-device AI ; all use exactly the same five-step training loop you just learned. The building blocks don't change; the architectures get more complex.

PyTorch Lightning: optional abstraction for experienced users

Once you have the manual training loop fully under your belt, PyTorch Lightning reduces boilerplate for larger projects. Subclass pl.LightningModule, implement training_step() and configure_optimizers(), and Lightning handles device placement, logging, checkpointing, and multi-GPU training automatically. It does not replace the manual loop , it's a layer on top. Learn the manual loop first.

Video Resources

Patrick Loeber

PyTorch Beginner Tutorial: Tensors & Autograd

Python Engineer

PyTorch Tutorial: Full Training Pipeline from Scratch

freeCodeCamp

PyTorch for Deep Learning , Full Course

Gallery

Contacts

PyTorch Tutorial for Beginners: From Tensors to Neural Networks

Prerequisites

What Is a Tensor?

Autograd: How PyTorch Learns

Building a Neural Network with nn.Module

The Training Loop

Complete MNIST Digit Classifier

GPU Acceleration

What to Learn Next

Services

Learn

Company