Machine Learning
How machines learn from data — covering types, the math behind learning, feature engineering, algorithms, evaluation, and where ML is applied across every industry.
Machine Learning (ML) is a branch of Artificial Intelligence where systems learn patterns from data rather than being explicitly programmed with rules. Instead of writing if/else logic for every possible scenario, you feed the algorithm labelled examples and it discovers the rules itself.
ML emerged as a formal discipline in the 1950s, but the modern revolution was driven by three converging forces: massive datasets (internet-scale), cheap compute (GPUs), and better algorithms. Today, ML is the engine behind spam filters, Netflix recommendations, fraud detection, voice assistants, protein folding, medical imaging, and much more.
- Developer writes explicit rules by hand
- Rules are rigid and fragile
- Breaks on edge cases or new data
- Great for deterministic, rule-based logic
- Auditable: you know exactly why it decided
- Algorithm learns rules directly from data
- Generalises to new, unseen examples
- Handles messy, real-world complexity
- Great for pattern matching and prediction
- Can be opaque (black box) without interpretability tools
ML is the right tool when: the problem involves complex patterns too hard to code manually, you have enough labelled data, and some prediction error is acceptable. ML is not the right tool when: you need guaranteed correct output (use deterministic code), you have very little data, or when a simple rule suffices (don't over-engineer).
At its core, ML is an optimisation problem. A model starts with random parameters, makes predictions, measures how wrong it is using a loss function, then adjusts its parameters to reduce that loss. This cycle repeats thousands of times.
A loss function measures prediction error. Common choices include:
- Mean Squared Error (MSE) — for regression: average of squared differences between predicted and actual values. Penalises large errors heavily.
- Cross-Entropy Loss — for classification: measures how far a predicted probability distribution is from the true label. Used in logistic regression and neural networks.
- Hinge Loss — used in Support Vector Machines; maximises the margin between classes.
Gradient descent is the core optimisation algorithm. Think of it as trying to find the bottom of a hilly landscape (the loss surface) by always stepping in the direction of steepest descent.
# w = model weights, L = loss function, lr = learning rate
for epoch in range(num_epochs):
predictions = model.forward(X) # make predictions
loss = compute_loss(predictions, y) # measure error
gradients = compute_gradients(loss, w) # direction of steepest increase
w = w - lr * gradients # step in the opposite direction
# repeat until loss stops decreasing
Three variants matter in practice:
- Batch GD — uses all training examples per update. Stable, but slow on large datasets.
- Stochastic GD (SGD) — updates on one example at a time. Fast but noisy.
- Mini-batch GD — the most common: updates on small batches (32–256 samples). Best of both worlds.
The learning rate (lr) controls how big each step is. Too large → overshoot the minimum and diverge. Too small → training takes forever. In practice, use adaptive optimisers like Adam or AdaGrad that automatically tune the learning rate per parameter.
You provide labelled training pairs — e.g. 50,000 emails each tagged as "spam" or "not spam". The algorithm learns a mapping function from email features to the label. Two sub-tasks:
- Classification — predict a discrete category. Binary: spam/not-spam, fraud/legit. Multi-class: cat/dog/bird. Multi-label: a photo can be tagged {outdoors, people, sports} simultaneously.
- Regression — predict a continuous number. Examples: house price prediction, temperature forecasting, stock return estimation, patient length-of-stay.
No labels needed. The algorithm must find structure on its own. Key tasks:
- Clustering (k-Means, DBSCAN, Hierarchical) — group similar items together. Used for customer segmentation, document grouping, image compression.
- Dimensionality Reduction (PCA, t-SNE, UMAP) — compress high-dimensional data into 2–3 dimensions for visualisation or to remove noise before feeding into a classifier.
- Anomaly Detection — identify unusual data points. Used for network intrusion detection, manufacturing quality control, financial fraud.
- Association Rule Mining (Apriori) — find co-occurrence patterns in transaction data. The classic example: "customers who buy diapers also buy beer".
RL is conceptually different from supervised and unsupervised learning. An agent observes a state, takes an action, receives a reward signal, and updates its policy (strategy) to maximise long-term reward. Key concepts:
- Policy — the strategy mapping states to actions.
- Value Function — how good is being in this state?
- Q-Function — how good is taking action A in state S?
- Exploration vs. Exploitation — balance trying new things vs. using known good strategies.
RL is used in: game playing (AlphaGo, OpenAI Five), robotics control, dialogue systems, personalised recommendations, and training LLMs via RLHF (Reinforcement Learning from Human Feedback).
Two modern extensions blur the boundaries: Semi-supervised learning uses a small labelled dataset plus a large unlabelled dataset — common when labelling is expensive. Self-supervised learning (used by GPT, BERT) creates its own labels from the data structure — e.g. mask a word in a sentence and predict it. This allows learning from internet-scale data without human annotation.
Features are the inputs your model uses to make predictions. Feature engineering is the art of transforming raw data into informative representations that algorithms can use effectively. It's one of the highest-leverage skills in practical ML.
- Normalisation / Standardisation — scale features to similar ranges. StandardScaler (z-score), MinMaxScaler (0–1). Critical for distance-based algorithms (kNN, SVM).
- One-Hot Encoding — convert categorical variables like "colour: red/blue/green" into binary columns. Avoids the model inferring false ordinal relationships.
- Ordinal Encoding — for categories with a natural order (low/medium/high → 0/1/2).
- Binning — convert continuous variables into discrete buckets. E.g. age → {child, teen, adult, senior}. Reduces overfitting on noisy numerical features.
- Log Transform — normalise skewed distributions. House prices, income, and click counts often follow power laws; log-transforming brings them closer to normal.
- Interaction Features — multiply or combine features to capture relationships. E.g.
price_per_sqft = price / area. - Date/Time Features — extract day of week, hour, is_weekend, days_since_last_event. Temporal patterns are often highly predictive.
- Handling Missing Values — impute with mean/median/mode, use an indicator column for "was this missing?", or use algorithms that handle NaNs natively (XGBoost).
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
# Separate numerical and categorical columns
num_cols = ['age', 'income', 'hours_per_week']
cat_cols = ['education', 'occupation', 'marital_status']
# Numerical: impute then scale
num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical: impute then one-hot encode
cat_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine into one preprocessor
preprocessor = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols)
])
More features isn't always better. Irrelevant features add noise and can hurt model performance (the "curse of dimensionality"). Techniques to select the most important features:
- Correlation analysis — drop features highly correlated with each other.
- Feature importance — tree-based models (Random Forest, XGBoost) provide built-in scores.
- Recursive Feature Elimination (RFE) — iteratively remove least important features.
- L1 Regularisation (Lasso) — shrinks unimportant feature weights to exactly zero, effectively selecting features during training.
A production ML project follows a structured lifecycle — from raw data to a monitored, deployed model:
Before writing a single line of code, define: What decision are we trying to automate? What does "good" look like — higher revenue, fewer fraud cases, faster diagnoses? What data exists and what data could be collected? What would a simple baseline look like (heuristic rules, human average)?
Understand your data before modelling. Check: class imbalance (if 99% of transactions are legitimate, a model predicting "always legit" scores 99% accuracy but is useless), data types, distributions, missing values, outliers, correlations. Tools: Pandas, Matplotlib, Seaborn, or Jupyter notebooks.
Hyperparameters are settings you choose before training (number of trees, max depth, learning rate) — they aren't learned from data. Tuning strategies:
- Grid Search — exhaustively try all combinations. Thorough but slow.
- Random Search — sample random combinations. Often finds good results faster.
- Bayesian Optimisation (Optuna, Hyperopt) — model the search space and intelligently focus on promising regions.
Models degrade over time because the real world changes. Data drift means input feature distributions shift (e.g. user behaviour changes post-pandemic). Concept drift means the relationship between features and target changes (e.g. "what makes a good credit risk" changes during a recession). You need monitoring pipelines that alert you when prediction quality degrades and trigger retraining.
| Algorithm | Type | Strengths | Weaknesses |
|---|---|---|---|
| Linear / Logistic Regression | Supervised | Interpretable, fast, great baseline | Can't capture non-linear patterns |
| Decision Tree | Supervised | Interpretable, handles mixed data types | Prone to overfitting, unstable |
| Random Forest | Supervised | Robust, handles missing data, feature importance | Slower inference, less interpretable than a single tree |
| Gradient Boosting (XGBoost, LightGBM, CatBoost) | Supervised | State-of-the-art on tabular data, Kaggle default | Many hyperparameters, can overfit |
| Support Vector Machine (SVM) | Supervised | Excellent on high-dimensional data, kernel trick | Slow on large datasets, needs feature scaling |
| k-Nearest Neighbours (kNN) | Supervised | Simple, no training phase, naturally handles multi-class | Slow at inference, sensitive to irrelevant features |
| Naive Bayes | Supervised | Very fast, good for text classification, works with small data | Strong independence assumption often violated |
| k-Means Clustering | Unsupervised | Fast, scalable, interpretable clusters | Must specify k, assumes spherical clusters |
| DBSCAN | Unsupervised | Finds arbitrary shapes, identifies outliers | Struggles with varying density |
| PCA | Unsupervised | Removes noise, speeds up downstream models, enables visualisation | Linear only, hard to interpret components |
Ensemble methods combine multiple weak learners into one strong learner. Two key strategies:
- Bagging (Bootstrap Aggregating) — train many models on random subsets of data in parallel, average predictions. Reduces variance. Random Forest is the canonical bagging algorithm.
- Boosting — train models sequentially, each correcting the previous model's mistakes. Reduces bias. XGBoost, LightGBM, and CatBoost are boosting algorithms and dominate structured data competitions.
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
# XGBoost is robust to unscaled features, but let's be explicit
model = xgb.XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
use_label_encoder=False,
eval_metric='logloss',
random_state=42
)
# 5-fold cross-validation — more reliable than a single train/test split
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"AUC: {scores.mean():.4f} ± {scores.std():.4f}")
The most critical concept in practical ML. A model that works perfectly on training data but fails on new data is useless. Understanding this tradeoff determines whether your model actually generalises.
- Model is too simple for the data
- High training error AND high test error
- Cause: too few features, model too constrained
- Fix: use more complex model, add features
- Example: fitting a straight line to curved data
- Model memorises training noise
- Very low training error, high test error
- Cause: too complex, too little data
- Fix: regularisation, more data, dropout, pruning
- Example: 100-degree polynomial fitting 10 data points
Regularisation deliberately constrains model complexity to prevent overfitting:
- L2 Regularisation (Ridge) — adds a penalty proportional to the sum of squared weights. Shrinks all weights toward zero but keeps all features. Default choice for most models.
- L1 Regularisation (Lasso) — adds a penalty proportional to the sum of absolute weights. Drives unimportant feature weights to exactly zero — built-in feature selection.
- Elastic Net — combination of L1 + L2. Best of both worlds.
- Dropout — randomly deactivate a fraction of neurons during each training step (neural networks only). Forces the network to learn redundant representations.
- Early Stopping — monitor validation loss during training; stop when it starts increasing. Simple and very effective.
- Cross-Validation — split data into k folds; train on k-1, validate on the remaining 1, rotate. More reliable estimate of real-world performance than a single train/test split.
Choosing the right metric is critical. Accuracy is often misleading — a model predicting "no cancer" 100% of the time achieves 99% accuracy on a dataset where only 1% of patients have cancer, yet it's catastrophically wrong.
Starting from the confusion matrix (True Positives, True Negatives, False Positives, False Negatives):
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes, equal cost of errors |
| Precision | TP / (TP + FP) | Cost of false positives is high (spam filter: don't mark legit as spam) |
| Recall (Sensitivity) | TP / (TP + FN) | Cost of false negatives is high (cancer detection: don't miss a case) |
| F1 Score | 2 × (P × R) / (P + R) | Imbalanced classes, when you care about both precision and recall |
| AUC-ROC | Area under ROC curve | Ranking quality; works well with imbalanced data |
| AUC-PR | Area under Precision-Recall curve | Highly imbalanced data (rare events: fraud, disease) |
| Metric | Notes |
|---|---|
| Mean Absolute Error (MAE) | Average absolute difference. Interpretable, robust to outliers. Same unit as target. |
| Root Mean Squared Error (RMSE) | Penalises large errors more heavily. Most commonly reported metric. |
| R² (R-squared) | Proportion of variance explained by the model. 1.0 = perfect, 0.0 = same as predicting the mean. |
| MAPE | Mean Absolute Percentage Error. Useful when relative error matters (e.g. sales forecasting). |
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, RocCurveDisplay
)
import matplotlib.pyplot as plt
# Predict probabilities for AUC
y_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)
# Detailed per-class report
print(classification_report(y_test, y_pred,
target_names=['Not Fraud', 'Fraud']))
# AUC score
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")
# Plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_prob)
plt.title(f'ROC Curve (AUC = {auc:.3f})')
plt.show()