Deep Learning Introduction

1 Deep Learning vs. Classical Machine Learning

Deep Learning (DL) is a subset of Machine Learning that uses artificial neural networks with many layers to learn hierarchical representations from raw data. It's not a different field -- it's ML taken to extremes of scale and depth.

🧠 Classical ML requires feature engineering -- domain experts decide what to measure. Deep Learning performs end-to-end feature learning -- given raw pixels, audio waveforms, or text tokens, the network discovers which features matter.

Classical Machine Learning

Needs manual feature engineering
Works well with small-to-medium datasets
Interpretable (decision trees, linear models)
Gradient boosting dominates tabular data
Training is fast (minutes)
Deployed on CPUs without issue

Deep Learning

Learns features automatically from raw data
Scales with data -- better with millions of samples
Often a black box (needs XAI tools)
Dominates images, text, audio, video
Training can take hours to weeks on GPUs
Requires GPU/TPU hardware for efficient training

The key driver of DL's dominance was the convergence of ImageNet (millions of labelled images), NVIDIA CUDA (GPU-accelerated matrix math), and architectural innovations (AlexNet in 2012 -> ResNet -> Transformer -> GPT-4). DL now achieves superhuman performance in vision, language, speech, and game playing.

2 Neural Network Fundamentals

The Artificial Neuron

A single neuron takes multiple inputs, multiplies each by a weight, sums them up, adds a bias, then passes the result through an activation function:

Single neuron -- mathematical form

output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias)

# In matrix form:
output = activation(W * x + b)

# In PyTorch:
import torch.nn as nn
neuron = nn.Linear(in_features=3, out_features=1)  # 3 inputs, 1 output

Activation Functions

Without activation functions, stacking layers would just be repeated linear transformations -- equivalent to a single linear layer. Activation functions introduce non-linearity, enabling networks to learn complex patterns.

Function	Formula	Used Where	Notes
ReLU	max(0, x)	Hidden layers (default)	Fast, simple; suffers from "dying ReLU" for x<0
Leaky ReLU	max(0.01x, x)	Hidden layers	Fixes dying ReLU with a small slope for negatives
GELU	x * Φ(x)	Transformers, BERT, GPT	Smooth version of ReLU; standard in modern LLMs
Sigmoid	1 / (1 + e⁻ˣ)	Binary output layer	Squashes to (0,1); saturation causes vanishing gradients
Softmax	eˣⁱ / Σeˣʲ	Multi-class output layer	Outputs a probability distribution over classes
Tanh	(eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)	RNNs, hidden layers	Squashes to (−1, 1); zero-centred unlike sigmoid

Layers and Network Depth

A neural network is organised into layers:

Input layer -- receives raw features (pixels, token embeddings, numerical values).
Hidden layers -- learn progressively abstract representations. Early layers learn edges; deep layers learn concepts.
Output layer -- produces the final prediction. One neuron per class for classification; one for regression.

Depth (number of layers) enables learning hierarchical features. A CNN trained on faces learns: edges -> curves -> eyes/nose -> face. The "deep" in Deep Learning refers to networks with many layers -- modern LLMs have 96+ layers.

PyTorch -- A fully-connected feedforward network

import torch
import torch.nn as nn

class FeedForwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),          # regularisation
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

# Instantiate
model = FeedForwardNet(input_dim=784, hidden_dim=256, output_dim=10)
print(model)  # see architecture summary

3 How Networks Learn: Backpropagation

Backpropagation is the algorithm that trains neural networks. It's the chain rule of calculus applied recursively through a computation graph to compute gradients of the loss with respect to every weight.

The Training Loop (Forward + Backward Pass)

PyTorch -- The complete training loop

import torch
import torch.nn as nn
import torch.optim as optim

model = FeedForwardNet(784, 256, 10)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(50):
    model.train()
    for X_batch, y_batch in train_loader:
        # 1. Forward pass -- compute predictions
        logits = model(X_batch)

        # 2. Compute loss
        loss = criterion(logits, y_batch)

        # 3. Backward pass -- compute gradients via backprop
        optimizer.zero_grad()    # clear previous gradients
        loss.backward()          # backpropagate through the graph

        # 4. Update weights -- gradient descent step
        optimizer.step()

    # Validation pass (no gradient tracking needed)
    model.eval()
    with torch.no_grad():
        val_preds = model(X_val)
        val_loss = criterion(val_preds, y_val)
    print(f"Epoch {epoch+1}: train={loss:.4f} val={val_loss:.4f}")

Vanishing & Exploding Gradients

As gradients are multiplied through many layers, they can shrink exponentially (vanishing gradients) making early layers learn very slowly, or grow exponentially (exploding gradients) causing training to diverge. Solutions:

Residual connections (ResNets) -- skip connections that let gradients flow directly from output to early layers, bypassing the multiplication chain.
Batch Normalisation -- normalises layer inputs, keeping activations in a healthy range and stabilising training significantly.
Gradient clipping -- caps the gradient norm at a maximum value. Standard practice in RNN and Transformer training.
Weight initialisation -- He init for ReLU networks, Xavier/Glorot for tanh. Good init means gradients start in a stable range.
LSTM / GRU gating -- designed specifically to mitigate vanishing gradients in sequential models.

Optimisers Beyond SGD

Optimiser	Key Idea	When to Use
SGD + Momentum	Accumulates velocity across steps; smooths noisy gradients	Computer vision, when you want tight control
Adam	Adaptive per-parameter learning rates using first & second moments	Default for most DL tasks -- fast convergence
AdamW	Adam + decoupled weight decay (L2 reg applied correctly)	Transformers, LLMs -- standard in Hugging Face
RMSProp	Divides learning rate by running average of squared gradients	RNNs; predecessor to Adam

4 Key Deep Learning Architectures

🖼️

CNN

Convolutional Neural Network

Uses convolutional filters that slide across input to detect local patterns (edges, textures, shapes). Translation invariant. Powers image classification, object detection, and medical imaging. Key models: ResNet, EfficientNet, YOLO.

➡️

RNN / LSTM

Recurrent & Long Short-Term Memory

Processes sequences step-by-step, maintaining hidden state. LSTMs add gating mechanisms to remember or forget information over long sequences. Used in time-series, speech recognition, and older NLP.

🔄

Transformer

Self-Attention Architecture

Processes all tokens simultaneously using self-attention -- every token attends to every other. No recurrence means massively parallel training. Foundation of BERT, GPT, T5, Llama, and all modern LLMs. Also powers ViT, Whisper, and Stable Diffusion.

🎨

Diffusion Model

Denoising Diffusion Probabilistic

Learns to reverse a noise-adding process. Training: gradually corrupt images with Gaussian noise. Inference: start from pure noise, denoise iteratively. Powers Stable Diffusion, DALL*E, Sora, and MidJourney.

⚔️

GAN

Generative Adversarial Network

Two networks compete: a Generator creates fake samples; a Discriminator tries to detect fakes. Adversarial training drives the Generator to produce increasingly realistic outputs. Used in image synthesis, deepfakes, data augmentation.

👁️

ViT

Vision Transformer

Applies the Transformer architecture to images by splitting them into patches (like tokens). With enough data, surpasses CNNs. Powers modern multimodal models (GPT-4V, Gemini) that understand both images and text.

💬

Encoder-Decoder

Seq2Seq Architecture

Encoder compresses input into a latent representation; Decoder generates output from it. Powers machine translation, text summarisation, image captioning, and code generation (T5, BART, original Transformer).

🧬

Autoencoder / VAE

Variational Autoencoder

Learns a compressed latent representation by encoding to a bottleneck then decoding back. VAEs add a probabilistic latent space, enabling sampling. Used in anomaly detection, image generation, and drug discovery.

The Transformer and Self-Attention (Deeper)

The Transformer, introduced in "Attention is All You Need" (2017), replaced recurrence with self-attention. For each token, self-attention computes a weighted sum over all other tokens -- allowing the model to relate "it" to "cat" across an entire paragraph in one step.

Self-Attention -- the core operation (simplified)

# Each token is projected into Query, Key, and Value vectors
Q = X @ W_q   # "what am I looking for?"
K = X @ W_k   # "what do I contain?"
V = X @ W_v   # "what information do I pass on?"

# Attention scores: how much each token should attend to each other
scores = Q @ K.T / sqrt(d_k)      # scaled dot product
weights = softmax(scores)          # normalise to probabilities

# Weighted sum of Values
attention_output = weights @ V     # attended representation

# Multi-Head: run h parallel attention heads, then concatenate
# Each head learns different relationship types (syntax, coreference, etc.)

Key advantages of self-attention over RNNs: processes all tokens in parallel (faster training), no vanishing gradient over long sequences, and explicit attention weights make the model somewhat interpretable.

5 Training Techniques & Regularisation

Technique	Purpose	How It Works
Batch Normalisation	Training stability	Normalises layer activations to zero mean/unit variance within each mini-batch. Reduces internal covariate shift, allows higher learning rates.
Layer Normalisation	Training stability (NLP)	Like BatchNorm but normalises across features of a single example. Standard in Transformers -- doesn't depend on batch size.
Dropout	Regularisation	Randomly zeros out a fraction p of neurons during each training step. Forces redundant learning. At inference, all neurons active, weights scaled by (1−p).
Weight Decay (L2)	Regularisation	Penalises large weights. In AdamW, applied directly to weights, not the gradient -- fixes a subtle bug in Adam + L2.
Learning Rate Scheduling	Better convergence	Warm-up then cosine decay is the standard in Transformer training. CyclicLR or ReduceLROnPlateau for CNNs.
Gradient Clipping	Prevent exploding gradients	Clips gradient norm to a maximum value (typically 1.0). Standard in RNN and LLM training.
Data Augmentation	Regularisation via data	Random crops, flips, rotations, colour jitter for images. Mixup, CutMix for advanced augmentation. Effectively multiplies dataset size.
Early Stopping	Prevent overfitting	Monitor validation loss; stop training when it starts increasing. Restore best checkpoint.
Mixed Precision Training	Speed & memory	Use FP16 for forward/backward pass, FP32 for weight updates. Halves memory; 2-3× faster on modern GPUs. Use torch.cuda.amp.

PyTorch -- Mixed precision + gradient clipping

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()   # handles FP16 scaling

for X_batch, y_batch in train_loader:
    optimizer.zero_grad()

    # Forward pass in FP16
    with autocast():
        logits = model(X_batch)
        loss = criterion(logits, y_batch)

    # Backward in FP16, but scaled to avoid underflow
    scaler.scale(loss).backward()

    # Unscale before clipping (required for clip to work correctly)
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Weight update
    scaler.step(optimizer)
    scaler.update()

scheduler.step()    # update learning rate

6 Transfer Learning & Fine-Tuning

Training deep networks from scratch requires millions of labelled samples and days of GPU time. Transfer learning solves this: take a model pretrained on a large general dataset, then adapt it to your specific task with far less data and compute.

💡 Transfer learning is why a team of 3 engineers can build a state-of-the-art model for a niche domain in days -- by standing on the shoulders of models trained on billions of examples.

Three Strategies

Feature Extraction

Freeze all pretrained layers
Only train a new output head
Fastest, fewest parameters to tune
Best when your data is very similar to pretraining data
Works with very small datasets (hundreds of examples)

Fine-Tuning

Unfreeze all (or top) layers
Train end-to-end on your task
Better performance, more data needed
Use small learning rate (1e-5 to 5e-5) to avoid forgetting
Standard for BERT/GPT domain adaptation

Parameter-Efficient Fine-Tuning (PEFT)

With billion-parameter LLMs, even fine-tuning is expensive. PEFT methods update a tiny fraction of parameters while keeping the backbone frozen:

LoRA (Low-Rank Adaptation) -- adds small trainable rank-decomposition matrices to attention layers. Trains <1% of parameters with near-full-fine-tune performance. Used to create domain-specific LLMs cheaply.
Adapter Layers -- insert small bottleneck modules between transformer layers. Freeze base model, train adapters only.
Prompt Tuning -- prepend learnable "soft prompt" tokens to the input. The model backbone is frozen; only the prompt embeddings are updated.
QLoRA -- quantise the base model to 4-bit, apply LoRA on top. Fine-tune a 70B model on a single 48GB GPU.

Python -- Fine-tune BERT for text classification (Hugging Face)

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset

# Load a pretrained BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

# Tokenise your dataset
dataset = load_dataset("imdb")
def tokenise(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length")
dataset = dataset.map(tokenise, batched=True)

# Fine-tune
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)
trainer = Trainer(model=model, args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"])
trainer.train()

7 Deep Learning Tools & Ecosystem

🔥

PyTorch

Meta AI

The dominant research framework. Dynamic computation graph, Pythonic API, excellent debugging. Standard in academic papers and increasingly in production. PyTorch 2.0 adds torch.compile for 2× speed.

💻

TensorFlow / Keras

Google

TensorFlow with Keras API. Strong production deployment story (TensorFlow Serving, TFLite for mobile, TF.js for browser). Static graph enables heavy optimisation.

🤗

Hugging Face

Transformers + Datasets

The GitHub of ML models. 500k+ pretrained models, easy tokenisers, Trainer API, PEFT library. Essential for NLP, vision, and multimodal work. Saves weeks of engineering.

⚡

Lightning / Fabric

PyTorch Lightning

Removes boilerplate from PyTorch training loops. Handles multi-GPU, mixed precision, logging, checkpointing automatically. Lets you focus on the model, not the infrastructure.

🏋️

W&B / MLflow

Experiment Tracking

Log metrics, hyperparameters, gradients, and model artefacts across every training run. Compare experiments visually. Essential for reproducibility and hyperparameter optimisation.

🛠️

ONNX / TensorRT

Deployment & Optimisation

ONNX converts models between frameworks. TensorRT (NVIDIA) optimises models for GPU inference -- quantisation, layer fusion, kernel autotuning. Critical for low-latency production serving.

8 Real-World Deep Learning Applications

🖥️

Computer Vision

Images & Video

Medical image segmentation (detecting tumours), autonomous vehicle perception (detecting pedestrians at 60mph), quality control in manufacturing, facial recognition, satellite image analysis for agriculture.

📝

Natural Language Processing

Text Understanding & Generation

ChatGPT, Claude, Gemini. Also: customer support automation, legal document review, medical note summarisation, real-time translation, sentiment analysis at scale, code generation.

🎤

Speech & Audio

Voice AI

Whisper (OpenAI) achieves near-human speech recognition across 100 languages. ElevenLabs and similar systems clone voices from seconds of audio. Real-time transcription in Google Meet, Zoom, and clinical settings.

💊

Drug Discovery

Computational Biology

AlphaFold2 predicted the 3D structure of virtually every known protein -- a 50-year unsolved problem. DL is now used to screen billions of molecules for drug candidates, cutting discovery timelines from years to months.

🤖

Robotics & Control

Embodied AI

Deep RL + imitation learning enable robots to learn dexterous manipulation. Boston Dynamics, Tesla Optimus, and DeepMind's RT-2 use transformer-based policies that generalise across tasks.

🎨

Generative Media

Images, Video, 3D

Stable Diffusion, DALL*E, MidJourney generate photorealistic images from text prompts. Sora generates multi-minute videos. NeRF and Gaussian Splatting reconstruct 3D scenes from 2D photos.

Welcome to CareerStack

Deep Learning