Machine Learning Introduction

1 What is Machine Learning?

Machine Learning (ML) is a branch of Artificial Intelligence where systems learn patterns from data rather than being explicitly programmed with rules. Instead of writing if/else logic for every possible scenario, you feed the algorithm labelled examples and it discovers the rules itself.

💡 The core insight: Traditional programming maps inputs + rules → outputs. Machine Learning maps inputs + outputs → rules.

ML emerged as a formal discipline in the 1950s, but the modern revolution was driven by three converging forces: massive datasets (internet-scale), cheap compute (GPUs), and better algorithms. Today, ML is the engine behind spam filters, Netflix recommendations, fraud detection, voice assistants, protein folding, medical imaging, and much more.

Traditional Programming vs. Machine Learning

Traditional Programming

Developer writes explicit rules by hand
Rules are rigid and fragile
Breaks on edge cases or new data
Great for deterministic, rule-based logic
Auditable: you know exactly why it decided

Machine Learning

Algorithm learns rules directly from data
Generalises to new, unseen examples
Handles messy, real-world complexity
Great for pattern matching and prediction
Can be opaque (black box) without interpretability tools

When to Use ML (and When Not To)

ML is the right tool when: the problem involves complex patterns too hard to code manually, you have enough labelled data, and some prediction error is acceptable. ML is not the right tool when: you need guaranteed correct output (use deterministic code), you have very little data, or when a simple rule suffices (don't over-engineer).

2 How Machines Actually Learn

At its core, ML is an optimisation problem. A model starts with random parameters, makes predictions, measures how wrong it is using a loss function, then adjusts its parameters to reduce that loss. This cycle repeats thousands of times.

The Loss Function

A loss function measures prediction error. Common choices include:

Mean Squared Error (MSE) — for regression: average of squared differences between predicted and actual values. Penalises large errors heavily.
Cross-Entropy Loss — for classification: measures how far a predicted probability distribution is from the true label. Used in logistic regression and neural networks.
Hinge Loss — used in Support Vector Machines; maximises the margin between classes.

Gradient Descent

Gradient descent is the core optimisation algorithm. Think of it as trying to find the bottom of a hilly landscape (the loss surface) by always stepping in the direction of steepest descent.

Gradient Descent — pseudocode intuition

# w = model weights,  L = loss function,  lr = learning rate
for epoch in range(num_epochs):
    predictions = model.forward(X)          # make predictions
    loss = compute_loss(predictions, y)     # measure error
    gradients = compute_gradients(loss, w)  # direction of steepest increase
    w = w - lr * gradients                  # step in the opposite direction
    # repeat until loss stops decreasing

Three variants matter in practice:

Batch GD — uses all training examples per update. Stable, but slow on large datasets.
Stochastic GD (SGD) — updates on one example at a time. Fast but noisy.
Mini-batch GD — the most common: updates on small batches (32–256 samples). Best of both worlds.

The Learning Rate

The learning rate (lr) controls how big each step is. Too large → overshoot the minimum and diverge. Too small → training takes forever. In practice, use adaptive optimisers like Adam or AdaGrad that automatically tune the learning rate per parameter.

🧠 Every ML algorithm — whether linear regression or a deep neural network — is essentially: define a loss, compute gradients, update weights. The architecture changes; the optimisation loop doesn't.

3 Three Types of Machine Learning

🎫

Supervised Learning

Learns from labelled examples (input + correct answer). The most common paradigm in production ML. Used for classification and regression.

🔍

Unsupervised Learning

Finds hidden patterns in unlabelled data. No correct answers provided — the algorithm discovers structure itself. Used for clustering, dimensionality reduction, anomaly detection.

🎮

Reinforcement Learning

An agent interacts with an environment, takes actions, and learns from rewards and penalties. Powers game AI (AlphaGo), robotics, and recommendation systems.

Supervised Learning — Deep Dive

You provide labelled training pairs — e.g. 50,000 emails each tagged as "spam" or "not spam". The algorithm learns a mapping function from email features to the label. Two sub-tasks:

Classification — predict a discrete category. Binary: spam/not-spam, fraud/legit. Multi-class: cat/dog/bird. Multi-label: a photo can be tagged {outdoors, people, sports} simultaneously.
Regression — predict a continuous number. Examples: house price prediction, temperature forecasting, stock return estimation, patient length-of-stay.

Unsupervised Learning — Deep Dive

No labels needed. The algorithm must find structure on its own. Key tasks:

Clustering (k-Means, DBSCAN, Hierarchical) — group similar items together. Used for customer segmentation, document grouping, image compression.
Dimensionality Reduction (PCA, t-SNE, UMAP) — compress high-dimensional data into 2–3 dimensions for visualisation or to remove noise before feeding into a classifier.
Anomaly Detection — identify unusual data points. Used for network intrusion detection, manufacturing quality control, financial fraud.
Association Rule Mining (Apriori) — find co-occurrence patterns in transaction data. The classic example: "customers who buy diapers also buy beer".

Reinforcement Learning — Deep Dive

RL is conceptually different from supervised and unsupervised learning. An agent observes a state, takes an action, receives a reward signal, and updates its policy (strategy) to maximise long-term reward. Key concepts:

Policy — the strategy mapping states to actions.
Value Function — how good is being in this state?
Q-Function — how good is taking action A in state S?
Exploration vs. Exploitation — balance trying new things vs. using known good strategies.

RL is used in: game playing (AlphaGo, OpenAI Five), robotics control, dialogue systems, personalised recommendations, and training LLMs via RLHF (Reinforcement Learning from Human Feedback).

Semi-Supervised & Self-Supervised Learning

Two modern extensions blur the boundaries: Semi-supervised learning uses a small labelled dataset plus a large unlabelled dataset — common when labelling is expensive. Self-supervised learning (used by GPT, BERT) creates its own labels from the data structure — e.g. mask a word in a sentence and predict it. This allows learning from internet-scale data without human annotation.

4 Feature Engineering

Features are the inputs your model uses to make predictions. Feature engineering is the art of transforming raw data into informative representations that algorithms can use effectively. It's one of the highest-leverage skills in practical ML.

🏆 "Applied machine learning is basically feature engineering." — Andrew Ng

Common Feature Engineering Techniques

Normalisation / Standardisation — scale features to similar ranges. StandardScaler (z-score), MinMaxScaler (0–1). Critical for distance-based algorithms (kNN, SVM).
One-Hot Encoding — convert categorical variables like "colour: red/blue/green" into binary columns. Avoids the model inferring false ordinal relationships.
Ordinal Encoding — for categories with a natural order (low/medium/high → 0/1/2).
Binning — convert continuous variables into discrete buckets. E.g. age → {child, teen, adult, senior}. Reduces overfitting on noisy numerical features.
Log Transform — normalise skewed distributions. House prices, income, and click counts often follow power laws; log-transforming brings them closer to normal.
Interaction Features — multiply or combine features to capture relationships. E.g. price_per_sqft = price / area.
Date/Time Features — extract day of week, hour, is_weekend, days_since_last_event. Temporal patterns are often highly predictive.
Handling Missing Values — impute with mean/median/mode, use an indicator column for "was this missing?", or use algorithms that handle NaNs natively (XGBoost).

Python — Common preprocessing pipeline with scikit-learn

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Separate numerical and categorical columns
num_cols = ['age', 'income', 'hours_per_week']
cat_cols = ['education', 'occupation', 'marital_status']

# Numerical: impute then scale
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical: impute then one-hot encode
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine into one preprocessor
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

Feature Selection

More features isn't always better. Irrelevant features add noise and can hurt model performance (the "curse of dimensionality"). Techniques to select the most important features:

Correlation analysis — drop features highly correlated with each other.
Feature importance — tree-based models (Random Forest, XGBoost) provide built-in scores.
Recursive Feature Elimination (RFE) — iteratively remove least important features.
L1 Regularisation (Lasso) — shrinks unimportant feature weights to exactly zero, effectively selecting features during training.

5 The Full ML Pipeline

A production ML project follows a structured lifecycle — from raw data to a monitored, deployed model:

1. Define

Business problem & success metric

2. Data

Collect, explore, clean

3. Features

Engineer & select features

4. Model

Train & tune algorithm

5. Evaluate

Validate on held-out data

6. Deploy

Serve predictions in production

7. Monitor

Watch drift, retrain loop

⚠️ In practice, 70–80% of ML engineering time is spent on steps 2 and 3 — data collection, cleaning, and feature engineering. The model training itself is often the shortest step.

Step 1: Problem Definition

Before writing a single line of code, define: What decision are we trying to automate? What does "good" look like — higher revenue, fewer fraud cases, faster diagnoses? What data exists and what data could be collected? What would a simple baseline look like (heuristic rules, human average)?

Step 2: Exploratory Data Analysis (EDA)

Understand your data before modelling. Check: class imbalance (if 99% of transactions are legitimate, a model predicting "always legit" scores 99% accuracy but is useless), data types, distributions, missing values, outliers, correlations. Tools: Pandas, Matplotlib, Seaborn, or Jupyter notebooks.

Step 4: Model Training and Hyperparameter Tuning

Hyperparameters are settings you choose before training (number of trees, max depth, learning rate) — they aren't learned from data. Tuning strategies:

Grid Search — exhaustively try all combinations. Thorough but slow.
Random Search — sample random combinations. Often finds good results faster.
Bayesian Optimisation (Optuna, Hyperopt) — model the search space and intelligently focus on promising regions.

Step 7: Monitoring & Data Drift

Models degrade over time because the real world changes. Data drift means input feature distributions shift (e.g. user behaviour changes post-pandemic). Concept drift means the relationship between features and target changes (e.g. "what makes a good credit risk" changes during a recession). You need monitoring pipelines that alert you when prediction quality degrades and trigger retraining.

6 Key Algorithms You Need to Know

Algorithm	Type	Strengths	Weaknesses
Linear / Logistic Regression	Supervised	Interpretable, fast, great baseline	Can't capture non-linear patterns
Decision Tree	Supervised	Interpretable, handles mixed data types	Prone to overfitting, unstable
Random Forest	Supervised	Robust, handles missing data, feature importance	Slower inference, less interpretable than a single tree
Gradient Boosting (XGBoost, LightGBM, CatBoost)	Supervised	State-of-the-art on tabular data, Kaggle default	Many hyperparameters, can overfit
Support Vector Machine (SVM)	Supervised	Excellent on high-dimensional data, kernel trick	Slow on large datasets, needs feature scaling
k-Nearest Neighbours (kNN)	Supervised	Simple, no training phase, naturally handles multi-class	Slow at inference, sensitive to irrelevant features
Naive Bayes	Supervised	Very fast, good for text classification, works with small data	Strong independence assumption often violated
k-Means Clustering	Unsupervised	Fast, scalable, interpretable clusters	Must specify k, assumes spherical clusters
DBSCAN	Unsupervised	Finds arbitrary shapes, identifies outliers	Struggles with varying density
PCA	Unsupervised	Removes noise, speeds up downstream models, enables visualisation	Linear only, hard to interpret components

Ensemble Methods: The Industry Default for Tabular Data

Ensemble methods combine multiple weak learners into one strong learner. Two key strategies:

Bagging (Bootstrap Aggregating) — train many models on random subsets of data in parallel, average predictions. Reduces variance. Random Forest is the canonical bagging algorithm.
Boosting — train models sequentially, each correcting the previous model's mistakes. Reduces bias. XGBoost, LightGBM, and CatBoost are boosting algorithms and dominate structured data competitions.

Python — XGBoost with cross-validation

import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# XGBoost is robust to unscaled features, but let's be explicit
model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# 5-fold cross-validation — more reliable than a single train/test split
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"AUC: {scores.mean():.4f} ± {scores.std():.4f}")

7 Overfitting, Underfitting & the Bias-Variance Tradeoff

The most critical concept in practical ML. A model that works perfectly on training data but fails on new data is useless. Understanding this tradeoff determines whether your model actually generalises.

Underfitting (High Bias)

Model is too simple for the data
High training error AND high test error
Cause: too few features, model too constrained
Fix: use more complex model, add features
Example: fitting a straight line to curved data

Overfitting (High Variance)

Model memorises training noise
Very low training error, high test error
Cause: too complex, too little data
Fix: regularisation, more data, dropout, pruning
Example: 100-degree polynomial fitting 10 data points

Regularisation Techniques

Regularisation deliberately constrains model complexity to prevent overfitting:

L2 Regularisation (Ridge) — adds a penalty proportional to the sum of squared weights. Shrinks all weights toward zero but keeps all features. Default choice for most models.
L1 Regularisation (Lasso) — adds a penalty proportional to the sum of absolute weights. Drives unimportant feature weights to exactly zero — built-in feature selection.
Elastic Net — combination of L1 + L2. Best of both worlds.
Dropout — randomly deactivate a fraction of neurons during each training step (neural networks only). Forces the network to learn redundant representations.
Early Stopping — monitor validation loss during training; stop when it starts increasing. Simple and very effective.
Cross-Validation — split data into k folds; train on k-1, validate on the remaining 1, rotate. More reliable estimate of real-world performance than a single train/test split.

8 Model Evaluation Metrics

Choosing the right metric is critical. Accuracy is often misleading — a model predicting "no cancer" 100% of the time achieves 99% accuracy on a dataset where only 1% of patients have cancer, yet it's catastrophically wrong.

Classification Metrics

Starting from the confusion matrix (True Positives, True Negatives, False Positives, False Negatives):

Metric	Formula	When to Use
Accuracy	(TP + TN) / Total	Balanced classes, equal cost of errors
Precision	TP / (TP + FP)	Cost of false positives is high (spam filter: don't mark legit as spam)
Recall (Sensitivity)	TP / (TP + FN)	Cost of false negatives is high (cancer detection: don't miss a case)
F1 Score	2 × (P × R) / (P + R)	Imbalanced classes, when you care about both precision and recall
AUC-ROC	Area under ROC curve	Ranking quality; works well with imbalanced data
AUC-PR	Area under Precision-Recall curve	Highly imbalanced data (rare events: fraud, disease)

Regression Metrics

Metric	Notes
Mean Absolute Error (MAE)	Average absolute difference. Interpretable, robust to outliers. Same unit as target.
Root Mean Squared Error (RMSE)	Penalises large errors more heavily. Most commonly reported metric.
R² (R-squared)	Proportion of variance explained by the model. 1.0 = perfect, 0.0 = same as predicting the mean.
MAPE	Mean Absolute Percentage Error. Useful when relative error matters (e.g. sales forecasting).

Python — Full evaluation report for a classifier

from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, RocCurveDisplay
)
import matplotlib.pyplot as plt

# Predict probabilities for AUC
y_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

# Detailed per-class report
print(classification_report(y_test, y_pred,
      target_names=['Not Fraud', 'Fraud']))

# AUC score
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")

# Plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_prob)
plt.title(f'ROC Curve (AUC = {auc:.3f})')
plt.show()

9 ML Tools & Libraries

🧪

scikit-learn

The standard Python library for classical ML. Consistent API across 40+ algorithms, preprocessing, pipelines, cross-validation, and metrics. Start here.

⚡

XGBoost / LightGBM

Industry-grade gradient boosting. XGBoost won hundreds of Kaggle competitions. LightGBM is faster on large datasets. CatBoost handles categorical features natively.

🐼

Pandas & NumPy

Pandas for data manipulation (DataFrames, groupby, merge, pivot). NumPy for numerical computation (arrays, linear algebra, broadcasting). Foundational to all Python ML.

📊

Matplotlib / Seaborn / Plotly

Matplotlib for low-level plotting. Seaborn for statistical visualisation. Plotly for interactive charts. Essential for EDA and communicating results.

📓

Jupyter Notebooks

The standard environment for EDA, prototyping, and sharing analyses. JupyterLab is the modern version. Notebooks allow mixing code, visualisations, and narrative text.

🧰

MLflow / W&B

Experiment tracking tools. Log parameters, metrics, and artefacts across training runs. Compare experiments. Essential when running dozens of hyperparameter tuning trials.

10 Real-World ML Applications

🛡️

Fraud Detection

Real-time anomaly detection flags suspicious transactions. Banks use gradient boosting + isolation forests, reducing fraud by 60–80% vs rule-based systems.

🎬

Content Recommendations

Netflix, Spotify, YouTube, and Amazon use collaborative filtering and two-tower neural models. Netflix estimates its recommendation engine saves $1B/year in churn.

🩺

Medical Diagnosis

ML models detect diabetic retinopathy, skin cancer, and pneumonia in medical images at radiologist-level accuracy. FDA has approved dozens of AI diagnostic tools.

📈

Demand Forecasting

Walmart, Amazon, and Uber use ML to predict demand hours to months ahead. Reduces inventory waste, improves delivery ETAs, and optimises driver positioning.

🔍

Search & Ranking

Google uses hundreds of ML models in search ranking. LinkedIn's job recommendations and e-commerce search all rely on learning-to-rank algorithms.

🏦

Credit Scoring

Lenders use gradient boosting models trained on thousands of features to predict default probability — replacing simple rule-based FICO scoring with much higher accuracy.