Deep Learning
Neural networks, backpropagation, and the architectures — CNN, RNN, Transformer, Diffusion — that power image recognition, language models, speech synthesis, and modern generative AI.
Deep Learning (DL) is a subset of Machine Learning that uses artificial neural networks with many layers to learn hierarchical representations from raw data. It's not a different field -- it's ML taken to extremes of scale and depth.
- Needs manual feature engineering
- Works well with small-to-medium datasets
- Interpretable (decision trees, linear models)
- Gradient boosting dominates tabular data
- Training is fast (minutes)
- Deployed on CPUs without issue
- Learns features automatically from raw data
- Scales with data -- better with millions of samples
- Often a black box (needs XAI tools)
- Dominates images, text, audio, video
- Training can take hours to weeks on GPUs
- Requires GPU/TPU hardware for efficient training
The key driver of DL's dominance was the convergence of ImageNet (millions of labelled images), NVIDIA CUDA (GPU-accelerated matrix math), and architectural innovations (AlexNet in 2012 -> ResNet -> Transformer -> GPT-4). DL now achieves superhuman performance in vision, language, speech, and game playing.
A single neuron takes multiple inputs, multiplies each by a weight, sums them up, adds a bias, then passes the result through an activation function:
output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias) # In matrix form: output = activation(W * x + b) # In PyTorch: import torch.nn as nn neuron = nn.Linear(in_features=3, out_features=1) # 3 inputs, 1 output
Without activation functions, stacking layers would just be repeated linear transformations -- equivalent to a single linear layer. Activation functions introduce non-linearity, enabling networks to learn complex patterns.
| Function | Formula | Used Where | Notes |
|---|---|---|---|
| ReLU | max(0, x) | Hidden layers (default) | Fast, simple; suffers from "dying ReLU" for x<0 |
| Leaky ReLU | max(0.01x, x) | Hidden layers | Fixes dying ReLU with a small slope for negatives |
| GELU | x * Φ(x) | Transformers, BERT, GPT | Smooth version of ReLU; standard in modern LLMs |
| Sigmoid | 1 / (1 + e⁻ˣ) | Binary output layer | Squashes to (0,1); saturation causes vanishing gradients |
| Softmax | eˣⁱ / Σeˣʲ | Multi-class output layer | Outputs a probability distribution over classes |
| Tanh | (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ) | RNNs, hidden layers | Squashes to (−1, 1); zero-centred unlike sigmoid |
A neural network is organised into layers:
- Input layer -- receives raw features (pixels, token embeddings, numerical values).
- Hidden layers -- learn progressively abstract representations. Early layers learn edges; deep layers learn concepts.
- Output layer -- produces the final prediction. One neuron per class for classification; one for regression.
Depth (number of layers) enables learning hierarchical features. A CNN trained on faces learns: edges -> curves -> eyes/nose -> face. The "deep" in Deep Learning refers to networks with many layers -- modern LLMs have 96+ layers.
import torch
import torch.nn as nn
class FeedForwardNet(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.3), # regularisation
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
return self.net(x)
# Instantiate
model = FeedForwardNet(input_dim=784, hidden_dim=256, output_dim=10)
print(model) # see architecture summary
Backpropagation is the algorithm that trains neural networks. It's the chain rule of calculus applied recursively through a computation graph to compute gradients of the loss with respect to every weight.
import torch
import torch.nn as nn
import torch.optim as optim
model = FeedForwardNet(784, 256, 10)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(50):
model.train()
for X_batch, y_batch in train_loader:
# 1. Forward pass -- compute predictions
logits = model(X_batch)
# 2. Compute loss
loss = criterion(logits, y_batch)
# 3. Backward pass -- compute gradients via backprop
optimizer.zero_grad() # clear previous gradients
loss.backward() # backpropagate through the graph
# 4. Update weights -- gradient descent step
optimizer.step()
# Validation pass (no gradient tracking needed)
model.eval()
with torch.no_grad():
val_preds = model(X_val)
val_loss = criterion(val_preds, y_val)
print(f"Epoch {epoch+1}: train={loss:.4f} val={val_loss:.4f}")
As gradients are multiplied through many layers, they can shrink exponentially (vanishing gradients) making early layers learn very slowly, or grow exponentially (exploding gradients) causing training to diverge. Solutions:
- Residual connections (ResNets) -- skip connections that let gradients flow directly from output to early layers, bypassing the multiplication chain.
- Batch Normalisation -- normalises layer inputs, keeping activations in a healthy range and stabilising training significantly.
- Gradient clipping -- caps the gradient norm at a maximum value. Standard practice in RNN and Transformer training.
- Weight initialisation -- He init for ReLU networks, Xavier/Glorot for tanh. Good init means gradients start in a stable range.
- LSTM / GRU gating -- designed specifically to mitigate vanishing gradients in sequential models.
| Optimiser | Key Idea | When to Use |
|---|---|---|
| SGD + Momentum | Accumulates velocity across steps; smooths noisy gradients | Computer vision, when you want tight control |
| Adam | Adaptive per-parameter learning rates using first & second moments | Default for most DL tasks -- fast convergence |
| AdamW | Adam + decoupled weight decay (L2 reg applied correctly) | Transformers, LLMs -- standard in Hugging Face |
| RMSProp | Divides learning rate by running average of squared gradients | RNNs; predecessor to Adam |
The Transformer, introduced in "Attention is All You Need" (2017), replaced recurrence with self-attention. For each token, self-attention computes a weighted sum over all other tokens -- allowing the model to relate "it" to "cat" across an entire paragraph in one step.
# Each token is projected into Query, Key, and Value vectors Q = X @ W_q # "what am I looking for?" K = X @ W_k # "what do I contain?" V = X @ W_v # "what information do I pass on?" # Attention scores: how much each token should attend to each other scores = Q @ K.T / sqrt(d_k) # scaled dot product weights = softmax(scores) # normalise to probabilities # Weighted sum of Values attention_output = weights @ V # attended representation # Multi-Head: run h parallel attention heads, then concatenate # Each head learns different relationship types (syntax, coreference, etc.)
Key advantages of self-attention over RNNs: processes all tokens in parallel (faster training), no vanishing gradient over long sequences, and explicit attention weights make the model somewhat interpretable.
| Technique | Purpose | How It Works |
|---|---|---|
| Batch Normalisation | Training stability | Normalises layer activations to zero mean/unit variance within each mini-batch. Reduces internal covariate shift, allows higher learning rates. |
| Layer Normalisation | Training stability (NLP) | Like BatchNorm but normalises across features of a single example. Standard in Transformers -- doesn't depend on batch size. |
| Dropout | Regularisation | Randomly zeros out a fraction p of neurons during each training step. Forces redundant learning. At inference, all neurons active, weights scaled by (1−p). |
| Weight Decay (L2) | Regularisation | Penalises large weights. In AdamW, applied directly to weights, not the gradient -- fixes a subtle bug in Adam + L2. |
| Learning Rate Scheduling | Better convergence | Warm-up then cosine decay is the standard in Transformer training. CyclicLR or ReduceLROnPlateau for CNNs. |
| Gradient Clipping | Prevent exploding gradients | Clips gradient norm to a maximum value (typically 1.0). Standard in RNN and LLM training. |
| Data Augmentation | Regularisation via data | Random crops, flips, rotations, colour jitter for images. Mixup, CutMix for advanced augmentation. Effectively multiplies dataset size. |
| Early Stopping | Prevent overfitting | Monitor validation loss; stop training when it starts increasing. Restore best checkpoint. |
| Mixed Precision Training | Speed & memory | Use FP16 for forward/backward pass, FP32 for weight updates. Halves memory; 2-3× faster on modern GPUs. Use torch.cuda.amp. |
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler() # handles FP16 scaling
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
# Forward pass in FP16
with autocast():
logits = model(X_batch)
loss = criterion(logits, y_batch)
# Backward in FP16, but scaled to avoid underflow
scaler.scale(loss).backward()
# Unscale before clipping (required for clip to work correctly)
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Weight update
scaler.step(optimizer)
scaler.update()
scheduler.step() # update learning rate
Training deep networks from scratch requires millions of labelled samples and days of GPU time. Transfer learning solves this: take a model pretrained on a large general dataset, then adapt it to your specific task with far less data and compute.
- Freeze all pretrained layers
- Only train a new output head
- Fastest, fewest parameters to tune
- Best when your data is very similar to pretraining data
- Works with very small datasets (hundreds of examples)
- Unfreeze all (or top) layers
- Train end-to-end on your task
- Better performance, more data needed
- Use small learning rate (1e-5 to 5e-5) to avoid forgetting
- Standard for BERT/GPT domain adaptation
With billion-parameter LLMs, even fine-tuning is expensive. PEFT methods update a tiny fraction of parameters while keeping the backbone frozen:
- LoRA (Low-Rank Adaptation) -- adds small trainable rank-decomposition matrices to attention layers. Trains <1% of parameters with near-full-fine-tune performance. Used to create domain-specific LLMs cheaply.
- Adapter Layers -- insert small bottleneck modules between transformer layers. Freeze base model, train adapters only.
- Prompt Tuning -- prepend learnable "soft prompt" tokens to the input. The model backbone is frozen; only the prompt embeddings are updated.
- QLoRA -- quantise the base model to 4-bit, apply LoRA on top. Fine-tune a 70B model on a single 48GB GPU.
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import load_dataset
# Load a pretrained BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
# Tokenise your dataset
dataset = load_dataset("imdb")
def tokenise(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length")
dataset = dataset.map(tokenise, batched=True)
# Fine-tune
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"])
trainer.train()