Deep Learning

Neural networks, backpropagation, and the architectures — CNN, RNN, Transformer, Diffusion — that power image recognition, language models, speech synthesis, and modern generative AI.

Intermediate Neural Networks Backpropagation CNNs Transformers Transfer Learning Regularisation PyTorch
1 Deep Learning vs. Classical Machine Learning

Deep Learning (DL) is a subset of Machine Learning that uses artificial neural networks with many layers to learn hierarchical representations from raw data. It's not a different field — it's ML taken to extremes of scale and depth.

🧠 Classical ML requires feature engineering — domain experts decide what to measure. Deep Learning performs end-to-end feature learning — given raw pixels, audio waveforms, or text tokens, the network discovers which features matter.
Classical Machine Learning
  • Needs manual feature engineering
  • Works well with small-to-medium datasets
  • Interpretable (decision trees, linear models)
  • Gradient boosting dominates tabular data
  • Training is fast (minutes)
  • Deployed on CPUs without issue
Deep Learning
  • Learns features automatically from raw data
  • Scales with data — better with millions of samples
  • Often a black box (needs XAI tools)
  • Dominates images, text, audio, video
  • Training can take hours to weeks on GPUs
  • Requires GPU/TPU hardware for efficient training

The key driver of DL's dominance was the convergence of ImageNet (millions of labelled images), NVIDIA CUDA (GPU-accelerated matrix math), and architectural innovations (AlexNet in 2012 → ResNet → Transformer → GPT-4). DL now achieves superhuman performance in vision, language, speech, and game playing.

2 Neural Network Fundamentals
The Artificial Neuron

A single neuron takes multiple inputs, multiplies each by a weight, sums them up, adds a bias, then passes the result through an activation function:

Single neuron — mathematical form
output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias)

# In matrix form:
output = activation(W · x + b)

# In PyTorch:
import torch.nn as nn
neuron = nn.Linear(in_features=3, out_features=1)  # 3 inputs, 1 output
Activation Functions

Without activation functions, stacking layers would just be repeated linear transformations — equivalent to a single linear layer. Activation functions introduce non-linearity, enabling networks to learn complex patterns.

FunctionFormulaUsed WhereNotes
ReLUmax(0, x)Hidden layers (default)Fast, simple; suffers from "dying ReLU" for x<0
Leaky ReLUmax(0.01x, x)Hidden layersFixes dying ReLU with a small slope for negatives
GELUx · Φ(x)Transformers, BERT, GPTSmooth version of ReLU; standard in modern LLMs
Sigmoid1 / (1 + e⁻ˣ)Binary output layerSquashes to (0,1); saturation causes vanishing gradients
Softmaxeˣⁱ / ΣeˣʲMulti-class output layerOutputs a probability distribution over classes
Tanh(eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)RNNs, hidden layersSquashes to (−1, 1); zero-centred unlike sigmoid
Layers and Network Depth

A neural network is organised into layers:

  • Input layer — receives raw features (pixels, token embeddings, numerical values).
  • Hidden layers — learn progressively abstract representations. Early layers learn edges; deep layers learn concepts.
  • Output layer — produces the final prediction. One neuron per class for classification; one for regression.

Depth (number of layers) enables learning hierarchical features. A CNN trained on faces learns: edges → curves → eyes/nose → face. The "deep" in Deep Learning refers to networks with many layers — modern LLMs have 96+ layers.

PyTorch — A fully-connected feedforward network
import torch
import torch.nn as nn

class FeedForwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),          # regularisation
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# Instantiate
model = FeedForwardNet(input_dim=784, hidden_dim=256, output_dim=10)
print(model)  # see architecture summary
3 How Networks Learn: Backpropagation

Backpropagation is the algorithm that trains neural networks. It's the chain rule of calculus applied recursively through a computation graph to compute gradients of the loss with respect to every weight.

The Training Loop (Forward + Backward Pass)
PyTorch — The complete training loop
import torch
import torch.nn as nn
import torch.optim as optim

model = FeedForwardNet(784, 256, 10)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(50):
    model.train()
    for X_batch, y_batch in train_loader:
        # 1. Forward pass — compute predictions
        logits = model(X_batch)

        # 2. Compute loss
        loss = criterion(logits, y_batch)

        # 3. Backward pass — compute gradients via backprop
        optimizer.zero_grad()    # clear previous gradients
        loss.backward()          # backpropagate through the graph

        # 4. Update weights — gradient descent step
        optimizer.step()

    # Validation pass (no gradient tracking needed)
    model.eval()
    with torch.no_grad():
        val_preds = model(X_val)
        val_loss = criterion(val_preds, y_val)
    print(f"Epoch {epoch+1}: train={loss:.4f} val={val_loss:.4f}")
Vanishing & Exploding Gradients

As gradients are multiplied through many layers, they can shrink exponentially (vanishing gradients) making early layers learn very slowly, or grow exponentially (exploding gradients) causing training to diverge. Solutions:

  • Residual connections (ResNets) — skip connections that let gradients flow directly from output to early layers, bypassing the multiplication chain.
  • Batch Normalisation — normalises layer inputs, keeping activations in a healthy range and stabilising training significantly.
  • Gradient clipping — caps the gradient norm at a maximum value. Standard practice in RNN and Transformer training.
  • Weight initialisation — He init for ReLU networks, Xavier/Glorot for tanh. Good init means gradients start in a stable range.
  • LSTM / GRU gating — designed specifically to mitigate vanishing gradients in sequential models.
Optimisers Beyond SGD
OptimiserKey IdeaWhen to Use
SGD + MomentumAccumulates velocity across steps; smooths noisy gradientsComputer vision, when you want tight control
AdamAdaptive per-parameter learning rates using first & second momentsDefault for most DL tasks — fast convergence
AdamWAdam + decoupled weight decay (L2 reg applied correctly)Transformers, LLMs — standard in Hugging Face
RMSPropDivides learning rate by running average of squared gradientsRNNs; predecessor to Adam
4 Key Deep Learning Architectures
🖼️
CNN
Convolutional Neural Network
Uses convolutional filters that slide across input to detect local patterns (edges, textures, shapes). Translation invariant. Powers image classification, object detection, and medical imaging. Key models: ResNet, EfficientNet, YOLO.
➡️
RNN / LSTM
Recurrent & Long Short-Term Memory
Processes sequences step-by-step, maintaining hidden state. LSTMs add gating mechanisms to remember or forget information over long sequences. Used in time-series, speech recognition, and older NLP.
🔄
Transformer
Self-Attention Architecture
Processes all tokens simultaneously using self-attention — every token attends to every other. No recurrence means massively parallel training. Foundation of BERT, GPT, T5, Llama, and all modern LLMs. Also powers ViT, Whisper, and Stable Diffusion.
🎨
Diffusion Model
Denoising Diffusion Probabilistic
Learns to reverse a noise-adding process. Training: gradually corrupt images with Gaussian noise. Inference: start from pure noise, denoise iteratively. Powers Stable Diffusion, DALL·E, Sora, and MidJourney.
⚔️
GAN
Generative Adversarial Network
Two networks compete: a Generator creates fake samples; a Discriminator tries to detect fakes. Adversarial training drives the Generator to produce increasingly realistic outputs. Used in image synthesis, deepfakes, data augmentation.
👁️
ViT
Vision Transformer
Applies the Transformer architecture to images by splitting them into patches (like tokens). With enough data, surpasses CNNs. Powers modern multimodal models (GPT-4V, Gemini) that understand both images and text.
💬
Encoder-Decoder
Seq2Seq Architecture
Encoder compresses input into a latent representation; Decoder generates output from it. Powers machine translation, text summarisation, image captioning, and code generation (T5, BART, original Transformer).
🧬
Autoencoder / VAE
Variational Autoencoder
Learns a compressed latent representation by encoding to a bottleneck then decoding back. VAEs add a probabilistic latent space, enabling sampling. Used in anomaly detection, image generation, and drug discovery.
The Transformer and Self-Attention (Deeper)

The Transformer, introduced in "Attention is All You Need" (2017), replaced recurrence with self-attention. For each token, self-attention computes a weighted sum over all other tokens — allowing the model to relate "it" to "cat" across an entire paragraph in one step.

Self-Attention — the core operation (simplified)
# Each token is projected into Query, Key, and Value vectors
Q = X @ W_q   # "what am I looking for?"
K = X @ W_k   # "what do I contain?"
V = X @ W_v   # "what information do I pass on?"

# Attention scores: how much each token should attend to each other
scores = Q @ K.T / sqrt(d_k)      # scaled dot product
weights = softmax(scores)          # normalise to probabilities

# Weighted sum of Values
attention_output = weights @ V     # attended representation

# Multi-Head: run h parallel attention heads, then concatenate
# Each head learns different relationship types (syntax, coreference, etc.)

Key advantages of self-attention over RNNs: processes all tokens in parallel (faster training), no vanishing gradient over long sequences, and explicit attention weights make the model somewhat interpretable.

5 Training Techniques & Regularisation
TechniquePurposeHow It Works
Batch NormalisationTraining stabilityNormalises layer activations to zero mean/unit variance within each mini-batch. Reduces internal covariate shift, allows higher learning rates.
Layer NormalisationTraining stability (NLP)Like BatchNorm but normalises across features of a single example. Standard in Transformers — doesn't depend on batch size.
DropoutRegularisationRandomly zeros out a fraction p of neurons during each training step. Forces redundant learning. At inference, all neurons active, weights scaled by (1−p).
Weight Decay (L2)RegularisationPenalises large weights. In AdamW, applied directly to weights, not the gradient — fixes a subtle bug in Adam + L2.
Learning Rate SchedulingBetter convergenceWarm-up then cosine decay is the standard in Transformer training. CyclicLR or ReduceLROnPlateau for CNNs.
Gradient ClippingPrevent exploding gradientsClips gradient norm to a maximum value (typically 1.0). Standard in RNN and LLM training.
Data AugmentationRegularisation via dataRandom crops, flips, rotations, colour jitter for images. Mixup, CutMix for advanced augmentation. Effectively multiplies dataset size.
Early StoppingPrevent overfittingMonitor validation loss; stop training when it starts increasing. Restore best checkpoint.
Mixed Precision TrainingSpeed & memoryUse FP16 for forward/backward pass, FP32 for weight updates. Halves memory; 2–3× faster on modern GPUs. Use torch.cuda.amp.
PyTorch — Mixed precision + gradient clipping
from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()   # handles FP16 scaling

for X_batch, y_batch in train_loader:
    optimizer.zero_grad()

    # Forward pass in FP16
    with autocast():
        logits = model(X_batch)
        loss = criterion(logits, y_batch)

    # Backward in FP16, but scaled to avoid underflow
    scaler.scale(loss).backward()

    # Unscale before clipping (required for clip to work correctly)
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Weight update
    scaler.step(optimizer)
    scaler.update()

scheduler.step()    # update learning rate
6 Transfer Learning & Fine-Tuning

Training deep networks from scratch requires millions of labelled samples and days of GPU time. Transfer learning solves this: take a model pretrained on a large general dataset, then adapt it to your specific task with far less data and compute.

💡 Transfer learning is why a team of 3 engineers can build a state-of-the-art model for a niche domain in days — by standing on the shoulders of models trained on billions of examples.
Three Strategies
Feature Extraction
  • Freeze all pretrained layers
  • Only train a new output head
  • Fastest, fewest parameters to tune
  • Best when your data is very similar to pretraining data
  • Works with very small datasets (hundreds of examples)
Fine-Tuning
  • Unfreeze all (or top) layers
  • Train end-to-end on your task
  • Better performance, more data needed
  • Use small learning rate (1e-5 to 5e-5) to avoid forgetting
  • Standard for BERT/GPT domain adaptation
Parameter-Efficient Fine-Tuning (PEFT)

With billion-parameter LLMs, even fine-tuning is expensive. PEFT methods update a tiny fraction of parameters while keeping the backbone frozen:

  • LoRA (Low-Rank Adaptation) — adds small trainable rank-decomposition matrices to attention layers. Trains <1% of parameters with near-full-fine-tune performance. Used to create domain-specific LLMs cheaply.
  • Adapter Layers — insert small bottleneck modules between transformer layers. Freeze base model, train adapters only.
  • Prompt Tuning — prepend learnable "soft prompt" tokens to the input. The model backbone is frozen; only the prompt embeddings are updated.
  • QLoRA — quantise the base model to 4-bit, apply LoRA on top. Fine-tune a 70B model on a single 48GB GPU.
Python — Fine-tune BERT for text classification (Hugging Face)
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset

# Load a pretrained BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

# Tokenise your dataset
dataset = load_dataset("imdb")
def tokenise(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length")
dataset = dataset.map(tokenise, batched=True)

# Fine-tune
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"])
trainer.train()
7 Deep Learning Tools & Ecosystem
🔥
PyTorch
Meta AI
The dominant research framework. Dynamic computation graph, Pythonic API, excellent debugging. Standard in academic papers and increasingly in production. PyTorch 2.0 adds torch.compile for 2× speed.
💻
TensorFlow / Keras
Google
TensorFlow with Keras API. Strong production deployment story (TensorFlow Serving, TFLite for mobile, TF.js for browser). Static graph enables heavy optimisation.
🤗
Hugging Face
Transformers + Datasets
The GitHub of ML models. 500k+ pretrained models, easy tokenisers, Trainer API, PEFT library. Essential for NLP, vision, and multimodal work. Saves weeks of engineering.
Lightning / Fabric
PyTorch Lightning
Removes boilerplate from PyTorch training loops. Handles multi-GPU, mixed precision, logging, checkpointing automatically. Lets you focus on the model, not the infrastructure.
🏋️
W&B / MLflow
Experiment Tracking
Log metrics, hyperparameters, gradients, and model artefacts across every training run. Compare experiments visually. Essential for reproducibility and hyperparameter optimisation.
🛠️
ONNX / TensorRT
Deployment & Optimisation
ONNX converts models between frameworks. TensorRT (NVIDIA) optimises models for GPU inference — quantisation, layer fusion, kernel autotuning. Critical for low-latency production serving.
8 Real-World Deep Learning Applications
🖥️
Computer Vision
Images & Video
Medical image segmentation (detecting tumours), autonomous vehicle perception (detecting pedestrians at 60mph), quality control in manufacturing, facial recognition, satellite image analysis for agriculture.
📝
Natural Language Processing
Text Understanding & Generation
ChatGPT, Claude, Gemini. Also: customer support automation, legal document review, medical note summarisation, real-time translation, sentiment analysis at scale, code generation.
🎤
Speech & Audio
Voice AI
Whisper (OpenAI) achieves near-human speech recognition across 100 languages. ElevenLabs and similar systems clone voices from seconds of audio. Real-time transcription in Google Meet, Zoom, and clinical settings.
💊
Drug Discovery
Computational Biology
AlphaFold2 predicted the 3D structure of virtually every known protein — a 50-year unsolved problem. DL is now used to screen billions of molecules for drug candidates, cutting discovery timelines from years to months.
🤖
Robotics & Control
Embodied AI
Deep RL + imitation learning enable robots to learn dexterous manipulation. Boston Dynamics, Tesla Optimus, and DeepMind's RT-2 use transformer-based policies that generalise across tasks.
🎨
Generative Media
Images, Video, 3D
Stable Diffusion, DALL·E, MidJourney generate photorealistic images from text prompts. Sora generates multi-minute videos. NeRF and Gaussian Splatting reconstruct 3D scenes from 2D photos.
Ready to go deeper?
Start the GenAI Engineer Roadmap
Build LLM applications, RAG systems, and AI agents from scratch
GenAI Engineer Roadmap →