Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064


PYTORCH

PyTorch Tutorial for Beginners: From Tensors to Neural Networks

You don't need a math PhD to build a working neural network in PyTorch. You need four concepts: tensors (the data), autograd (the learning), nn.Module (the architecture), and the training loop (the process). This guide teaches all four , and ends with a complete digit classifier you can run yourself.

Prerequisites

Before writing any code, make sure your environment is ready. If you haven't installed PyTorch yet, follow the official install guide , it covers CUDA, macOS MPS, and CPU-only setups.

Before You Start
PyTorch installed
Run python -c "import torch; print(torch.__version__)" , should print 2.x or higher. See the install guide for version support details
Python 3.10 or later
PyTorch 2.x supports Python 3.10 – 3.13 (stable as of 2026; 3.14 is pre-release). Check with python --version
Basic Python fluency
You should be comfortable with classes, functions, and list operations , no advanced math required
torchvision (for MNIST example)
Install with pip install torchvision , needed for the dataset loader in the final section

This tutorial builds progressively. Each section introduces one concept, shows it in code, and connects it to the next. By the end you'll have a complete working classifier.

What Is a Tensor?

A tensor is an n-dimensional array , the same idea as a NumPy array, except tensors can live on a GPU. A scalar is a 0D tensor. A vector is 1D. A matrix is 2D. An image batch might be 4D: [batch, channels, height, width]. That's the only new vocabulary you need.

50×+
GPU acceleration over CPU for numeric tensor operations
PyTorch official documentation , illustrative of magnitude; varies by operation and GPU model

The GPU speedup is what separates "runs in minutes" from "runs overnight." Moving to GPU takes one .to(device) call , PyTorch handles the math, the memory, and the result routing. The only constraint: your model and all its input tensors must be on the same device (covered in the GPU section).

Python , creating tensors
import torch

# Create from Python list
x = torch.tensor([1.0, 2.0, 3.0])
print(x)          # tensor([1., 2., 3.])

# Random tensor ,  shape (3, 4)
y = torch.rand(3, 4)

# Zeros and ones
z = torch.zeros(2, 3)
o = torch.ones(5)

# Check shape, dtype, device
print(y.shape)    # torch.Size([3, 4])
print(y.dtype)    # torch.float32
print(y.device)   # cpu

Tensors support all the arithmetic you'd expect , addition, multiplication, matrix multiplication , and these operations compose cleanly with autograd, which we'll cover next.

Python , tensor operations
# Element-wise ops
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)       # tensor([5., 7., 9.])
print(a * b)       # tensor([ 4., 10., 18.])

# Matrix multiplication
m1 = torch.rand(3, 4)
m2 = torch.rand(4, 2)
result = torch.matmul(m1, m2)   # shape: (3, 2)

# Reshape
x = torch.arange(12).float()
x = x.reshape(3, 4)           # shape: (3, 4)

One attribute you'll see throughout training: requires_grad=True. When set on a tensor, PyTorch tracks every operation involving that tensor so it can compute gradients later. This is the bridge to autograd.

Autograd: How PyTorch Learns

Neural network training boils down to one question: how do I change my model's parameters to reduce the error? Autograd answers it automatically. You don't calculate derivatives by hand , PyTorch does it for you by watching your code run.

How Autograd Works , Three Concepts
DAG
Directed Acyclic Graph , computation graph built during the forward pass; every operation is a node
.backward()
Traverses the DAG in reverse, applying the chain rule to compute all gradients
.grad
Attribute where gradients accumulate on leaf tensors after backward()

The computation graph is dynamic , it rebuilds every forward pass. This means your model architecture can change between iterations, which is why PyTorch is the dominant framework for research.

Python , autograd basics
import torch

# Mark tensor for gradient tracking
x = torch.tensor(3.0, requires_grad=True)

# Forward pass ,  builds the computation graph
y = x ** 2 + 2 * x + 1   # y = x² + 2x + 1

# Backward pass ,  computes dy/dx
y.backward()

# Gradient stored in .grad  (dy/dx = 2x + 2 = 8 at x=3)
print(x.grad)     # tensor(8.)
Why zero_grad() matters
By default, PyTorch accumulates gradients ; each .backward() call adds to the .grad attribute rather than replacing it. In a training loop you must call optimizer.zero_grad() at the start of each step, or gradients from the previous batch will corrupt the current one.
Python , demonstrating gradient accumulation
import torch

x = torch.tensor(2.0, requires_grad=True)

# First backward ,  grad = 2x = 4
loss = x ** 2
loss.backward()
print(x.grad)     # tensor(4.)

# Second backward without zeroing ,  grad ACCUMULATES to 8
loss = x ** 2
loss.backward()
print(x.grad)     # tensor(8.) ,  wrong! should be 4.

# Correct: zero before each backward
x.grad.zero_()
loss = x ** 2
loss.backward()
print(x.grad)     # tensor(4.) ,  correct

In practice you rarely call .backward() on individual tensors directly. You use a loss function and an optimizer, which wrap this logic cleanly. That's the training loop , the next section.

Building a Neural Network with nn.Module

PyTorch's torch.nn module provides all the building blocks you need. Every neural network in PyTorch is a Python class that inherits from nn.Module. That parent class does the heavy lifting: it tracks all learnable parameters, moves them with the model when you call .to(device), and plugs directly into optimizers.

The structure is always the same:

  • __init__ , define your layers (nn.Linear, nn.Conv2d, etc.)
  • forward , define how data flows through those layers, including activation functions
Python , defining a neural network
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers ,  parameters are auto-tracked
        self.fc1 = nn.Linear(784, 256)   # input → hidden 1
        self.fc2 = nn.Linear(256, 128)   # hidden 1 → hidden 2
        self.fc3 = nn.Linear(128, 10)    # hidden 2 → output

    def forward(self, x):
        # Define the data flow
        x = F.relu(self.fc1(x))           # apply ReLU after layer 1
        x = F.relu(self.fc2(x))           # apply ReLU after layer 2
        x = self.fc3(x)                   # raw logits ,  no activation on output
        return x

model = SimpleNet()
print(model)

# Count trainable parameters
total = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Parameters: {total:,}")  # 235,146

Notice: the forward method doesn't explicitly enable gradient tracking , that's handled automatically because model parameters have requires_grad=True by default. The output of fc3 is a raw logit , an unnormalized score before any probability conversion. We leave it raw because CrossEntropyLoss expects logits; it applies log-softmax internally. You just write the data flow.

The SimpleNet example uses F.relu (from torch.nn.functional) as an inline activation. The MNIST example below uses self.relu = nn.ReLU() as a module attribute. These are functionally equivalent ; both apply the same ReLU activation. The module form (nn.ReLU()) is preferred when you want activations to appear in print(model) or when using hooks for inspection.

💡
Raw logits on the output layer
Don't apply softmax in forward() when using CrossEntropyLoss , that loss function applies log-softmax internally. Applying softmax yourself first leads to double-application and numerically worse results.
Common nn.Module misconception
You never call forward() directly. Call the model object like a function: output = model(x). PyTorch hooks (batch norm, dropout, etc.) only fire through the module's __call__, which wraps forward().

The Training Loop

The training loop is the heartbeat of every neural network. You run it once per epoch , one full pass over your dataset. Inside, you process data in batches of typically 32–256 samples, update the model after each batch, and track your loss.

The Five Steps , Every Batch, Every Epoch
1
optimizer.zero_grad()
Clear gradients left over from the previous batch , otherwise they accumulate
2
outputs = model(inputs)
Forward pass , data flows through the network, the computation graph is built
3
loss = criterion(outputs, labels)
Compute the loss , how wrong are the predictions? CrossEntropyLoss for classification
4
loss.backward()
Backward pass , traverse the DAG in reverse, compute gradients for every parameter
5
optimizer.step()
Update weights using the computed gradients , SGD, Adam, etc. each apply their own formula
Python , training loop template (complete example in MNIST section below)
import torch
import torch.nn as nn
import torch.optim as optim

# Assumes model, train_loader defined ,  see MNIST section for complete example
num_epochs = 5
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(num_epochs):
    model.train()   # training mode: activates dropout + batch norm statistics
    running_loss = 0.0

    for inputs, labels in train_loader:
        optimizer.zero_grad()               # step 1: clear gradients (BEFORE forward)
        outputs = model(inputs)             # step 2: forward pass
        loss = criterion(outputs, labels)   # step 3: compute loss
        loss.backward()                     # step 4: backward pass
        optimizer.step()                    # step 5: update weights
        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch+1}: loss = {avg_loss:.4f}")

The model.train() call isn't cosmetic , it activates dropout layers and batch normalization's training-mode behavior (tracking running stats). When you switch to evaluation, call model.eval() and wrap inference in torch.no_grad() to disable gradient tracking and cut memory usage. This pair is so common that the MNIST example below demonstrates exactly this pattern.

Note on zero_grad() placement: It must come before the forward pass , not after optimizer.step(). If you zero after the step, you clear the gradients while they're still being used. Zeroing before each forward pass ensures each batch starts clean with no carryover from the previous iteration.

Choosing a loss function
Classification: nn.CrossEntropyLoss , combines log-softmax + negative log-likelihood. Binary classification: nn.BCEWithLogitsLoss. Regression: nn.MSELoss (mean squared error). Recommendation: start with Adam at lr=1e-3 , it's robust and rarely needs tuning early on.

Complete MNIST Digit Classifier

MNIST is 70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels. It's been the "hello world" of neural networks since 1998. Here's a complete, runnable classifier using everything covered so far.

The architecture: 3 fully connected layers (784 → 256 → 128 → 10). Input is a flattened 28×28 pixel image. Output is 10 raw logits ; one per digit class.

Python , complete MNIST classifier
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ── Data ─────────────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))   # MNIST mean, std
])

full_train = datasets.MNIST(root='./data', train=True,
                             download=True, transform=transform)
test_data  = datasets.MNIST(root='./data', train=False,
                             transform=transform)

# Split training set: 55k train, 5k validation
from torch.utils.data import random_split
train_data, val_data = random_split(full_train, [55000, 5000])

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_data,   batch_size=64)
test_loader  = DataLoader(test_data,  batch_size=64)

# ── Model ─────────────────────────────────────────────────────
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(-1, 784)          # flatten 28x28 → 784
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.fc3(x)             # raw logits

model     = MNISTNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# ── Training + per-epoch validation ──────────────────────────
for epoch in range(5):
    model.train()    # training mode: enables dropout + batch norm stats
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss    = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    # Validate after each epoch to track progress
    model.eval()     # eval mode: disables dropout, fixes batch norm
    val_correct = val_total = 0
    with torch.no_grad():    # disable gradient tracking for speed
        for inputs, labels in val_loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            val_total   += labels.size(0)
            val_correct += (predicted == labels).sum().item()
    print(f"Epoch {epoch+1}: val accuracy = {100*val_correct/val_total:.1f}%")

# ── Evaluation ────────────────────────────────────────────────
model.eval()
correct, total = 0, 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs    = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total     += labels.size(0)
        correct   += (predicted == labels).sum().item()

print(f"Test accuracy: {100 * correct / total:.2f}%")
# Typical result: ~97% after 5 epochs (varies by run and hardware)

Run this script as-is ; torchvision.datasets.MNIST downloads the data automatically the first time. After 5 epochs (roughly 2–3 minutes on CPU), you should see ~97% test accuracy (results vary by hardware and random initialization). That's a working neural network. The normalization values (0.1307,) and (0.3081,) are MNIST-specific mean and standard deviation , for other datasets, compute them from your own data.

GPU Acceleration

Switching from CPU to GPU requires exactly three changes: select the device, move the model, move the data. That's it. PyTorch handles everything else ; no rewriting the model class, no changing the training loop logic.

Python , device-agnostic training
import torch

# Detect best available hardware
if torch.cuda.is_available():
    device = torch.device('cuda')       # NVIDIA GPU
elif torch.backends.mps.is_available():
    device = torch.device('mps')        # Apple Silicon
else:
    device = torch.device('cpu')

print(f"Using: {device}")

# Move model to device
model = MNISTNet().to(device)

# In training loop ,  move each batch to device
for inputs, labels in train_loader:
    inputs = inputs.to(device)
    labels = labels.to(device)

    optimizer.zero_grad()
    outputs = model(inputs)
    loss    = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
The model and all its input tensors must be on the same device
Mixing devices causes a runtime error: RuntimeError: Expected all tensors to be on the same device. If you get this error, trace which tensor you forgot to .to(device). Common culprits: labels tensor, or a tensor you created inside forward() with a literal like torch.zeros(n) , those default to CPU.
Python , save and load model checkpoint
# Save model weights after training
torch.save(model.state_dict(), 'mnist_model.pt')

# Load on any device (map_location handles GPU → CPU and vice versa)
model = MNISTNet()
model.load_state_dict(torch.load('mnist_model.pt', map_location=device))
model.to(device)
model.eval()    # set to eval mode before inference

What to Learn Next

You've covered the complete foundation: tensors, autograd, nn.Module, the training loop, and GPU acceleration. The MNIST classifier is a real, working model. Where you go from here depends on what you want to build.

  • Computer vision , Add convolutional layers (nn.Conv2d) and explore torchvision.models for pretrained architectures like ResNet and EfficientNet
  • NLP and LLMs , Hugging Face Transformers builds on PyTorch. Start with AutoModelForSequenceClassification and fine-tune on your own data
  • Faster training , Enable torch.compile(model) for significant speedup on supported models and hardware, or try mixed precision with torch.autocast (speedup varies substantially by model architecture and GPU generation)
  • Scale up , PyTorch Lightning reduces boilerplate for multi-GPU training; TorchDistributed handles full cluster setups

The real-world applications , computer vision diagnostics, recommendation systems, on-device AI ; all use exactly the same five-step training loop you just learned. The building blocks don't change; the architectures get more complex.

PyTorch Lightning: optional abstraction for experienced users
Once you have the manual training loop fully under your belt, PyTorch Lightning reduces boilerplate for larger projects. Subclass pl.LightningModule, implement training_step() and configure_optimizers(), and Lightning handles device placement, logging, checkpointing, and multi-GPU training automatically. It does not replace the manual loop , it's a layer on top. Learn the manual loop first.
Research verified. Code examples derived from PyTorch official tutorials (pytorch.org/tutorials) and GeeksForGeeks PyTorch tutorial (geeksforgeeks.org). GPU speedup claim (50×+) is illustrative of magnitude ; actual speedup varies by operation and GPU model. MNIST accuracy (~97%) is a well-established benchmark result; individual runs vary by initialization and hardware.
PyTorch is a trademark of The Linux Foundation. The PyTorch name and logo are used here for informational and editorial purposes only. Tech Jacks Solutions is not affiliated with Meta, The Linux Foundation, or the PyTorch Foundation. All trademarks are the property of their respective owners.
Before You Use AI
Your Privacy

PyTorch is an open-source framework ; your training code and local models remain entirely on your machine. When using cloud-based services (Hugging Face Hub, cloud notebooks, hosted inference APIs), data may be processed on third-party servers subject to their privacy policies. Review terms before uploading proprietary datasets or model weights.

Enterprise deployments can use private model registries and VPC-isolated training environments. The PyTorch Foundation does not collect usage telemetry from the core framework.

Mental Health & AI Use

Learning deep learning can be frustrating ; debugging model failures, managing long training runs, and navigating a fast-moving field creates real pressure. If you are experiencing distress:

  • 988 Suicide & Crisis Lifeline: Call or text 988
  • SAMHSA Helpline: 1-800-662-4357
  • Crisis Text Line: Text HOME to 741741

AI systems can produce plausible-sounding but incorrect guidance. For mental health, medical, legal, or financial decisions, always consult a qualified professional.

Your Rights & Our Transparency

Under GDPR and CCPA, you have rights to access, correct, and delete data processed about you. Tech Jacks Solutions does not sell personal data. This article contains no affiliate links to PyTorch or related commercial services.

Editorial independence: this content was produced using publicly available documentation, official PyTorch tutorials, and third-party educational sources. All code examples are derived from the official PyTorch documentation (pytorch.org/tutorials). This article is subject to the EU AI Act's transparency requirements for AI-assisted content.