Build a Fraud Detection Classifier
Learning Outcomes
- Generate and explore a synthetic bank-style payment transaction dataset
- Handle extreme class imbalance (fraud is <0.1% of transactions)
- Train a production-style Random Forest fraud classifier
- Interpret precision, recall, F1 and why accuracy is misleading for fraud
- Tune the decision threshold to minimise false negatives (missed fraud)
- Save and reload the trained fraud model with joblib
Prerequisites
No API key needed
This lab uses only open-source libraries. No external API keys required.
sklearn.datasets. Convert to DataFrame. df.head(), df.describe(), class distribution bar chart.train_test_split with stratify. Discuss why test data must stay unseen.# ── SETUP ────────────────────────────────────────── import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve import joblib # ════════════════════════════════════════════════════════════════ # STEP 1 — SHARED BANK DATA GENERATOR (run this cell first) # Labs 1, 2 and 3 all use the same 10,000 synthetic banking # customers so your work is consistent across the ML track. # seed=42 → same results every run, no downloads needed. # ════════════════════════════════════════════════════════════════ np.random.seed(42) N = 10_000 customer_id = [f"CUS_{i:05d}" for i in range(N)] age = np.random.randint(22, 72, N) annual_income = np.random.lognormal(11, 0.55, N).clip(18_000, 480_000) years_employed = np.clip(np.random.exponential(6, N), 0, 40) dti_ratio = np.random.uniform(0.05, 0.65, N) # customer debt-to-income burden num_accounts = np.random.randint(1, 9, N) total_debt = annual_income * dti_ratio * np.random.uniform(3, 8, N) # Credit score: income + tenure + low DTI → higher score (spans 300–850) income_pct = np.log1p(annual_income - 18_000) / np.log1p(462_000) employ_pct = np.minimum(years_employed / 25, 1.0) credit_score = ( 300 + income_pct * 180 + employ_pct * 120 + (1 - np.minimum(dti_ratio / 0.65, 1.0)) * 200 + np.random.normal(0, 25, N) ).clip(300, 850).astype(int) customers = pd.DataFrame({ 'customer_id': customer_id, 'age': age, 'annual_income': annual_income.round(2), 'years_employed': years_employed.round(1), 'dti_ratio': dti_ratio.round(4), 'num_accounts': num_accounts, 'total_debt': total_debt.round(2), 'credit_score': credit_score, }) print(f"OK {len(customers):,} customers | income ${customers.annual_income.median():,.0f} median | score {customers.credit_score.mean():.0f} avg (range {customers.credit_score.min()}-{customers.credit_score.max()})") # ── STEP 2 — LAB 1 VIEW: derive payment transactions from shared customers ── # Each customer makes 1-5 transactions; low credit score skews toward fraud patterns tx_rows, tx_types_pool = [], ['CASH_OUT','PAYMENT','CASH_IN','TRANSFER','DEBIT'] for _, cust in customers.iterrows(): for _ in range(np.random.randint(1, 6)): tx_type = np.random.choice(tx_types_pool, p=[0.35,0.34,0.22,0.06,0.03]) amount = float(np.random.lognormal(5.5, 1.8)) old_orig = cust['annual_income'] / 12 * np.random.uniform(0.5, 3) new_orig = max(0, old_orig - amount) old_dest = float(np.random.lognormal(6, 2)) risk_score = ((1 - cust['credit_score'] / 850) * 0.5 + (0.3 if tx_type in ['TRANSFER','CASH_OUT'] else 0) + (0.2 if new_orig < old_orig * 0.05 else 0)) fraud = int(np.random.rand() < risk_score * 0.04) # ~1.3% fraud rate tx_rows.append({'customer_id': cust['customer_id'], 'type': tx_type, 'amount': round(amount,2), 'oldbalanceOrg': round(old_orig,2), 'newbalanceOrig': round(new_orig,2), 'oldbalanceDest': round(old_dest,2), 'newbalanceDest': round(old_dest+amount,2), 'isFraud': fraud}) df = pd.DataFrame(tx_rows) print(f"Transactions: {len(df):,} | Fraud rate: {df['isFraud'].mean():.4%}") df['type'].value_counts().plot(kind='bar', title='Transaction Types — retail-bank style (same customers as Labs 2 & 3)') plt.show()
# Features used in real fraud scoring systems features = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'step'] # Encode transaction type df['type_encoded'] = df['type'].map({'CASH_OUT':0,'PAYMENT':1,'CASH_IN':2,'TRANSFER':3,'DEBIT':4}) features.append('type_encoded') X = df[features]; y = df['isFraud'] # Stratify to keep fraud ratio in both splits X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y) print(f"Train fraud cases: {y_train.sum()} / {len(y_train):,}")
# Baseline — predict "not fraud" always (deceptive 99.9% accuracy) baseline_acc = 1 - y_test.mean() print(f"Naive baseline accuracy: {baseline_acc:.4%}") # Random Forest with class_weight to penalise missed fraud rf = RandomForestClassifier( n_estimators=100, class_weight='balanced', # critical for imbalanced data random_state=42, n_jobs=-1 ) rf.fit(X_train, y_train) preds = rf.predict(X_test) print(classification_report(y_test, preds, target_names=['Legit', 'Fraud']))
# Default threshold (0.5) vs lower threshold (catches more fraud) probs = rf.predict_proba(X_test)[:, 1] for threshold in [0.5, 0.3, 0.2]: preds_t = (probs >= threshold).astype(int) cm = confusion_matrix(y_test, preds_t) fn = cm[1,0] # false negatives = missed fraud print(f"Threshold {threshold} → Missed fraud: {fn} | False alerts: {cm[0,1]}") # What signals matter most to the fraud model? feat_imp = pd.Series(rf.feature_importances_, index=features) feat_imp.sort_values().plot(kind='barh', title='Fraud Signal Importance (bank-style)') plt.tight_layout(); plt.show()
joblib.dump(rf, 'fraud_detector.joblib') # Score a new transaction (like a bank's real-time fraud API) loaded_model = joblib.load('fraud_detector.joblib') new_txn = pd.DataFrame([{ 'amount': 9999.99, 'oldbalanceOrg': 10000, 'newbalanceOrig': 0, 'oldbalanceDest': 0, 'newbalanceDest': 9999.99, 'step': 1, 'type_encoded': 3 # TRANSFER }]) prob = loaded_model.predict_proba(new_txn)[0, 1] print(f"Fraud probability: {prob:.2%} → {'🚨 FLAG FOR REVIEW' if prob > 0.3 else '✅ APPROVE'}")
TODO Tasks (Complete before moving on)
- Filter the dataset to only
TRANSFERandCASH_OUTtransactions (where fraud occurs). How does recall change? - Try thresholds of 0.1, 0.2, 0.3, 0.4, 0.5. Plot precision vs recall at each. Which threshold would a bank use to minimise chargebacks?
- Add two engineered features:
balance_diff_origandbalance_diff_dest. Does fraud F1 improve? - Use
cross_val_scorewithscoring='f1'(5 folds). Why is accuracy the wrong metric here?
- Apply SMOTE (
imblearn.over_sampling) to oversample fraud cases. Compare F1 vsclass_weight='balanced'. - Train an XGBoost classifier (
xgboost). XGBoost is widely used by major banks for production fraud models — compare its recall to Random Forest. - Build a precision-recall AUC comparison chart across Logistic Regression, Random Forest, and XGBoost.
- Advanced: Simulate a real-time scoring API — wrap the model in a function that takes a transaction dict, returns risk score + recommended action (approve/review/decline).
🗺️ This lab maps to Step 3 — Model Training Fundamentals on the ML Engineer Roadmap.