ML Project Ideas — CareerStack

📉

Customer Churn Prediction

Predict which telecom customers will cancel their subscription using behavioural and account data. A classic binary classification problem that appears in nearly every ML interview.

Classification Beginner

⏱ 2–3 days

Customer Churn Prediction

Logistic Regression · Random Forest · Feature Engineering

Classification Beginner

Dataset

📦

Telco Customer Churn — Kaggle

7,043 rows · 21 features · Binary target · Free download

Skills you'll practise

pandasscikit-learnLogistic RegressionRandom ForestEDAFeature EngineeringROC-AUCConfusion Matrix

What to showcase

Class imbalance handling with SMOTE or class_weight

Feature importance chart from Random Forest

ROC curve comparison: Logistic vs RF vs XGBoost

Implementation Steps

1
EDA — visualise churn rate, plot distributions, identify missing values and outliers
2
Feature Engineering — encode categoricals, create tenure buckets, scale numerics
3
Baseline Model — fit Logistic Regression, evaluate with accuracy, precision, recall, F1
4
Handle Imbalance — apply SMOTE oversampling or class_weight='balanced', compare results
5
Improve with RF — train Random Forest, tune n_estimators and max_depth via GridSearchCV
6
Summarise — plot ROC curve, feature importances, write up business impact of each churning segment

Project Structure

churn-prediction/
├── data/           # raw + processed CSVs
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_features.ipynb
│   └── 03_models.ipynb
├── src/
│   ├── features.py
│   └── model.py
├── reports/        # charts, ROC curves
└── README.md

🏠

House Price Prediction

Predict residential property prices from 80+ features including size, neighbourhood, quality, and age. The definitive regression benchmark — a must-have in any ML portfolio.

Regression XGBoost Beginner

⏱ 2–3 days

House Price Prediction

Linear Regression · Ridge/Lasso · XGBoost · Feature Engineering

Regression XGBoost Beginner

Dataset

📦

Ames Housing Dataset — Kaggle

2,930 rows · 80 features · Continuous target (SalePrice) · Free download

Skills you'll practise

pandasscikit-learnXGBoostRidge/LassoLog TransformMissing Value ImputationRMSE / R²Cross-validation

What to showcase

Log-transform target to fix skewness — explain why this helps RMSE

Lasso feature selection: show which features it eliminates and why

XGBoost vs Ridge comparison with 5-fold CV leaderboard

Implementation Steps

1
EDA — plot SalePrice distribution, identify skewness, correlation heatmap
2
Missing Values — impute numerics with median, categoricals with mode or "None"
3
Feature Engineering — create TotalSF, HouseAge, RemodelAge; one-hot encode categoricals
4
Baseline — Linear Regression, evaluate RMSE on log(SalePrice)
5
Regularisation — Ridge and Lasso with alpha tuning; compare feature selection
6
XGBoost — tune n_estimators, learning_rate, max_depth; plot feature importances

Project Structure

house-prices/
├── data/           # train.csv, test.csv
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   └── 03_modelling.ipynb
├── src/
│   ├── preprocess.py
│   └── train.py
└── README.md

🔍

Credit Card Fraud Detection

Identify fraudulent transactions from 284,000+ real credit card records. Tackle severe class imbalance (0.17% fraud rate) using SMOTE, cost-sensitive learning, and ensemble methods.

Classification Random Forest Intermediate

⏱ 3–4 days

Credit Card Fraud Detection

Random Forest · XGBoost · SMOTE · Precision-Recall Tradeoff

Classification Random Forest Intermediate

Dataset

📦

Credit Card Fraud Detection — Kaggle (ULB)

284,807 transactions · PCA-transformed features · 0.17% fraud rate · Free download

Skills you'll practise

imbalanced-learnSMOTERandom ForestXGBoostPrecision-Recall CurveThreshold TuningBusiness Cost Analysis

What to showcase

Why accuracy is the wrong metric — pivot to F1/AUC-PR

Threshold optimisation: show the precision-recall tradeoff curve

Business framing: cost of false positives vs missed fraud

Implementation Steps

1
EDA — visualise class imbalance, transaction amount/time distributions
2
Baseline — fit Logistic Regression, show why accuracy is misleading (99.8% by predicting all legit)
3
Handle Imbalance — compare SMOTE oversampling vs undersampling vs class_weight
4
Ensemble Models — train Random Forest and XGBoost; compare AUC-PR
5
Threshold Tuning — plot precision-recall curve, choose threshold based on business cost
6
Explainability — SHAP values to show which PCA components drive fraud predictions

👥

Employee Attrition Predictor

Use IBM's HR dataset to predict which employees are likely to quit — and explain why with SHAP. A standout project that combines predictive ML with explainability, a hot topic in industry.

Classification XGBoost Intermediate

⏱ 3–4 days

Employee Attrition Predictor

XGBoost · SHAP Explainability · HR Analytics

Classification XGBoost Intermediate

Dataset

📦

IBM HR Analytics Employee Attrition — Kaggle

1,470 employees · 35 features · Binary target · Synthetic IBM data

Skills you'll practise

XGBoostSHAPFeature ImportanceGridSearchCVLabelEncoderCross-validationBusiness Storytelling

What to showcase

SHAP beeswarm plot — the most impressive ML chart in any portfolio

Per-employee explanations: "why does John have 72% attrition risk?"

Actionable HR recommendations from model insights

Implementation Steps

1
EDA — attrition rate by department, job role, overtime, salary band
2
Preprocessing — encode ordinal and nominal features, drop constant columns (EmployeeCount, Over18)
3
Train XGBoost — with scale_pos_weight to handle imbalance, 5-fold stratified CV
4
Tune Hyperparameters — GridSearchCV on max_depth, n_estimators, learning_rate, subsample
5
SHAP Analysis — generate summary plot, bar plot, and waterfall for individual predictions
6
Report — write 5 data-driven HR recommendations with SHAP evidence

📈

Retail Sales Forecasting

Forecast weekly store sales for 45 Walmart locations using historical sales, holiday flags, fuel prices, and economic indicators. Real time-series forecasting with XGBoost feature engineering.

Regression XGBoost Time Series Intermediate

⏱ 4–5 days

Retail Sales Forecasting

XGBoost · LightGBM · Lag Features · Time-Series CV

Regression XGBoost Time Series Intermediate

Dataset

📦

Walmart Store Sales Forecasting — Kaggle

45 stores · 2.5 years of weekly data · Holiday & economic features · Free download

Skills you'll practise

Lag FeaturesRolling StatisticsXGBoostLightGBMTimeSeriesSplitWMAE metricHoliday Engineering

What to showcase

Time-series cross-validation (no data leakage) — a common interview question

Lag and rolling window feature engineering explained in a notebook

Holiday effect visualisation: Thanksgiving spike, markdown impact

Implementation Steps

1
EDA — plot weekly sales trends, seasonality, holiday effects, store variance
2
Lag Features — create lag_1w, lag_4w, lag_52w (same week last year)
3
Rolling Features — 4-week and 12-week rolling mean and std, trend indicators
4
Date Features — week_of_year, month, is_holiday, days_to_holiday
5
Model — XGBoost with TimeSeriesSplit CV, tune with Optuna or manual grid
6
Compare — benchmark XGBoost vs LightGBM vs simple moving average baseline

📧

Spam Email Classifier

Build a text classifier that distinguishes spam from legitimate email using TF-IDF and Naive Bayes. The go-to NLP entry point for ML learners — fast to train, easy to interpret, and a classic interview topic.

Classification NLP Beginner

⏱ 1–2 days

Spam Email Classifier

Naive Bayes · SVM · TF-IDF · Text Preprocessing

Classification NLP Beginner

Dataset

📦

SMS Spam Collection — UCI ML Repository

5,572 messages · Binary label (ham/spam) · Clean text · Free download

Skills you'll practise

NLTK / spaCyTF-IDFNaive BayesSVMText PreprocessingPipelineWordCloud

What to showcase

WordCloud of most frequent spam vs ham terms

sklearn Pipeline combining TF-IDF + classifier — production-ready pattern

Why Naive Bayes works so well for text despite independence assumption

Implementation Steps

1
Preprocessing — lowercase, remove punctuation, stopwords, stemming/lemmatisation
2
Vectorise — CountVectorizer vs TF-IDF; explain the difference with examples
3
Train — MultinomialNB as baseline; compare with LinearSVC and Logistic Regression
4
Evaluate — focus on recall for spam (you don't want spam in inbox) vs precision tradeoff
5
Pipeline — wrap TF-IDF + classifier in sklearn Pipeline; show how to predict new messages

🚢

Titanic Survival — Ensemble Deep Dive

Go beyond the basic Titanic submission. Combine Gradient Boosting, Random Forest, and Logistic Regression into a stacked ensemble. Ideal for showing advanced feature engineering and model stacking skills.

Classification Ensemble Beginner

⏱ 2–3 days

Titanic Survival — Ensemble Deep Dive

Random Forest · Gradient Boosting · Stacking · Advanced Feature Engineering

Classification Ensemble Beginner

Dataset

📦

Titanic — Machine Learning from Disaster (Kaggle)

891 training rows · 11 raw features · Binary survival target · Iconic benchmark

Skills you'll practise

Random ForestGradientBoostingStackingClassifierTitle ExtractionFamily SizeCabinDeckCross-val Score

What to showcase

Creative feature engineering: extract Title from Name, Deck from Cabin

Stacking with StackingClassifier — show each base model's CV score vs the meta-model

Learning curves to diagnose over/underfitting per model

Implementation Steps

1
Feature Engineering — extract Title, FamilySize, IsAlone, CabinDeck, FarePerPerson
2
Imputation — median Age by Title group, "Missing" cabin category
3
Base Models — Logistic Regression, Random Forest, GradientBoostingClassifier
4
Stacking — use sklearn's StackingClassifier with LR meta-learner; evaluate with stratified CV
5
Ablation Study — measure accuracy impact of each engineered feature added/removed

🏦

Loan Default Risk Scorer

Score credit applicants by default probability using XGBoost with careful feature selection and business-aware threshold optimisation. Mirrors a real FinTech ML system — high hiring signal.

Classification XGBoost Intermediate

⏱ 4–5 days

Loan Default Risk Scorer

XGBoost · Feature Selection · AUC-ROC · Threshold Optimisation

Classification XGBoost Intermediate

Dataset

📦

Give Me Some Credit — Kaggle

150,000 rows · 10 financial features · Binary default target · Free download

Skills you'll practise

XGBoostFeature SelectionMutual InformationAUC-ROCThreshold TuningCalibrationSHAP

What to showcase

Probability calibration — why raw XGBoost scores aren't true probabilities

Business-aware threshold: optimise for expected monetary loss, not F1

SHAP force plot: explain a single applicant's risk score

Implementation Steps

1
EDA — distribution of delinquencies, debt ratio, income; identify outliers and heavy tails
2
Imputation — median for monthly income, 0 for NumberOfDependents missing
3
Feature Selection — mutual information, XGBoost built-in importance, drop multicollinear features
4
Train XGBoost — stratified k-fold, early stopping, scale_pos_weight for imbalance
5
Calibration — apply Platt scaling (CalibratedClassifierCV) and compare reliability diagrams
6
Business Threshold — sweep thresholds, compute expected loss = P(default) × loan_amount for each

💰

Data Science Salary Predictor

Predict salaries across AI/ML/Data roles globally using job title, experience, company size, and remote work ratio. A highly relatable project that resonates with every interviewer in tech.

Regression XGBoost Beginner

⏱ 2–3 days

Data Science Salary Predictor

XGBoost · Gradient Boosting · OrdinalEncoder · Cross-validation

Regression XGBoost Beginner

Dataset

📦

Data Science Salaries 2024 — Kaggle

16,000+ rows · Job title, experience, company, remote ratio · Free download

Skills you'll practise

XGBoostOrdinalEncoderTargetEncoderCross-validationRMSE / MAEFeature ImportancePlotly

What to showcase

Interactive Plotly chart: salary by role and experience — eye-catching in a portfolio

Target encoding for job_title (hundreds of categories) — explain vs one-hot

Add a live "predict my salary" widget in the README

Implementation Steps

1
EDA — salary distributions by role, seniority, country; outlier inspection
2
Encoding — ordinal for experience_level, target encoding for job_title and company_location
3
Baseline — Linear Regression with one-hot encoding; set RMSE benchmark
4
XGBoost — train with 5-fold CV, tune learning_rate and max_depth, evaluate MAE
5
Visualise — Plotly bar chart of median salary by title; feature importance chart
6
Predict Tool — simple input() loop or Streamlit widget to predict salary from user inputs

⚡

Household Energy Consumption Forecaster

Forecast electricity demand from 2 million+ hourly readings using advanced feature engineering, LightGBM, and a stacked ensemble. An advanced project that demonstrates production-level thinking.

Regression Ensemble Time Series Advanced

⏱ 5–7 days

Household Energy Consumption Forecaster

LightGBM · XGBoost · Stacked Ensemble · Advanced Feature Engineering

Regression Ensemble Time Series Advanced

Dataset

📦

Individual Household Electric Power Consumption — UCI ML Repository

2M+ minute-level readings · 7 features · 4 years · Free download

Skills you'll practise

LightGBMXGBoostStackingFourier FeaturesLarge Dataset HandlingTimeSeriesSplitMAE / MAPE

What to showcase

Handling 2M+ rows efficiently with chunked reading and dtype optimisation

Fourier terms to capture daily/weekly/yearly seasonality without SARIMA

Stacked ensemble: show that blending LightGBM + XGBoost beats each alone

Implementation Steps

1
Load Efficiently — parse with chunked read_csv, downcast dtypes, resample to hourly
2
EDA — daily/weekly/seasonal decomposition, missing value patterns, outlier removal
3
Feature Engineering — hour, dayofweek, month, is_weekend, lag_1h, lag_24h, lag_168h, rolling stats
4
Fourier Features — sin/cos transforms of hour and day_of_year to encode cyclical patterns
5
Models — LightGBM and XGBoost with TimeSeriesSplit; compare MAPE per hour-of-day
6
Stack — average predictions of best LightGBM + XGBoost; show ensemble outperforms both

Project Structure

energy-forecasting/
├── data/           # raw UCI data + hourly resampled
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_features.ipynb
│   ├── 03_lgbm.ipynb
│   └── 04_ensemble.ipynb
├── src/
│   ├── features.py   # all feature engineering
│   ├── train.py      # train & evaluate pipeline
│   └── ensemble.py   # stacking logic
└── README.md

Welcome to CareerStack

Build ML Projects That Get You Hired