Portfolio Projects · ML · Sklearn · XGBoost · Real Datasets
Build ML Projects That Get You Hired
10 real-world ML projects with curated datasets, step-by-step implementation plans, and portfolio tips. Each one is designed around the skills hiring managers actually look for — from your first regression to production-grade XGBoost pipelines.
Customer Churn Prediction
Predict which telecom customers will cancel their subscription using behavioural and account data. A classic binary classification problem that appears in nearly every ML interview.
Classification
Beginner
Customer Churn Prediction
Logistic Regression · Random Forest · Feature Engineering
Classification
Beginner
Dataset
Telco Customer Churn — Kaggle
Skills you'll practise
pandasscikit-learnLogistic RegressionRandom ForestEDAFeature EngineeringROC-AUCConfusion Matrix
What to showcase
Class imbalance handling with SMOTE or class_weight
Feature importance chart from Random Forest
ROC curve comparison: Logistic vs RF vs XGBoost
Implementation Steps
- 1EDA — visualise churn rate, plot distributions, identify missing values and outliers
- 2Feature Engineering — encode categoricals, create tenure buckets, scale numerics
- 3Baseline Model — fit Logistic Regression, evaluate with accuracy, precision, recall, F1
- 4Handle Imbalance — apply SMOTE oversampling or class_weight='balanced', compare results
- 5Improve with RF — train Random Forest, tune n_estimators and max_depth via GridSearchCV
- 6Summarise — plot ROC curve, feature importances, write up business impact of each churning segment
Project Structure
churn-prediction/ ├── data/ # raw + processed CSVs ├── notebooks/ │ ├── 01_eda.ipynb │ ├── 02_features.ipynb │ └── 03_models.ipynb ├── src/ │ ├── features.py │ └── model.py ├── reports/ # charts, ROC curves └── README.md
House Price Prediction
Predict residential property prices from 80+ features including size, neighbourhood, quality, and age. The definitive regression benchmark — a must-have in any ML portfolio.
Regression
XGBoost
Beginner
House Price Prediction
Linear Regression · Ridge/Lasso · XGBoost · Feature Engineering
Regression
XGBoost
Beginner
Dataset
Ames Housing Dataset — Kaggle
Skills you'll practise
pandasscikit-learnXGBoostRidge/LassoLog TransformMissing Value ImputationRMSE / R²Cross-validation
What to showcase
Log-transform target to fix skewness — explain why this helps RMSE
Lasso feature selection: show which features it eliminates and why
XGBoost vs Ridge comparison with 5-fold CV leaderboard
Implementation Steps
- 1EDA — plot SalePrice distribution, identify skewness, correlation heatmap
- 2Missing Values — impute numerics with median, categoricals with mode or "None"
- 3Feature Engineering — create TotalSF, HouseAge, RemodelAge; one-hot encode categoricals
- 4Baseline — Linear Regression, evaluate RMSE on log(SalePrice)
- 5Regularisation — Ridge and Lasso with alpha tuning; compare feature selection
- 6XGBoost — tune n_estimators, learning_rate, max_depth; plot feature importances
Project Structure
house-prices/ ├── data/ # train.csv, test.csv ├── notebooks/ │ ├── 01_eda.ipynb │ ├── 02_preprocessing.ipynb │ └── 03_modelling.ipynb ├── src/ │ ├── preprocess.py │ └── train.py └── README.md
Credit Card Fraud Detection
Identify fraudulent transactions from 284,000+ real credit card records. Tackle severe class imbalance (0.17% fraud rate) using SMOTE, cost-sensitive learning, and ensemble methods.
Classification
Random Forest
Intermediate
Credit Card Fraud Detection
Random Forest · XGBoost · SMOTE · Precision-Recall Tradeoff
Classification
Random Forest
Intermediate
Dataset
Credit Card Fraud Detection — Kaggle (ULB)
Skills you'll practise
imbalanced-learnSMOTERandom ForestXGBoostPrecision-Recall CurveThreshold TuningBusiness Cost Analysis
What to showcase
Why accuracy is the wrong metric — pivot to F1/AUC-PR
Threshold optimisation: show the precision-recall tradeoff curve
Business framing: cost of false positives vs missed fraud
Implementation Steps
- 1EDA — visualise class imbalance, transaction amount/time distributions
- 2Baseline — fit Logistic Regression, show why accuracy is misleading (99.8% by predicting all legit)
- 3Handle Imbalance — compare SMOTE oversampling vs undersampling vs class_weight
- 4Ensemble Models — train Random Forest and XGBoost; compare AUC-PR
- 5Threshold Tuning — plot precision-recall curve, choose threshold based on business cost
- 6Explainability — SHAP values to show which PCA components drive fraud predictions
Employee Attrition Predictor
Use IBM's HR dataset to predict which employees are likely to quit — and explain why with SHAP. A standout project that combines predictive ML with explainability, a hot topic in industry.
Classification
XGBoost
Intermediate
Employee Attrition Predictor
XGBoost · SHAP Explainability · HR Analytics
Classification
XGBoost
Intermediate
Dataset
IBM HR Analytics Employee Attrition — Kaggle
Skills you'll practise
XGBoostSHAPFeature ImportanceGridSearchCVLabelEncoderCross-validationBusiness Storytelling
What to showcase
SHAP beeswarm plot — the most impressive ML chart in any portfolio
Per-employee explanations: "why does John have 72% attrition risk?"
Actionable HR recommendations from model insights
Implementation Steps
- 1EDA — attrition rate by department, job role, overtime, salary band
- 2Preprocessing — encode ordinal and nominal features, drop constant columns (EmployeeCount, Over18)
- 3Train XGBoost — with scale_pos_weight to handle imbalance, 5-fold stratified CV
- 4Tune Hyperparameters — GridSearchCV on max_depth, n_estimators, learning_rate, subsample
- 5SHAP Analysis — generate summary plot, bar plot, and waterfall for individual predictions
- 6Report — write 5 data-driven HR recommendations with SHAP evidence
Retail Sales Forecasting
Forecast weekly store sales for 45 Walmart locations using historical sales, holiday flags, fuel prices, and economic indicators. Real time-series forecasting with XGBoost feature engineering.
Regression
XGBoost
Time Series
Intermediate
Retail Sales Forecasting
XGBoost · LightGBM · Lag Features · Time-Series CV
Regression
XGBoost
Time Series
Intermediate
Dataset
Walmart Store Sales Forecasting — Kaggle
Skills you'll practise
Lag FeaturesRolling StatisticsXGBoostLightGBMTimeSeriesSplitWMAE metricHoliday Engineering
What to showcase
Time-series cross-validation (no data leakage) — a common interview question
Lag and rolling window feature engineering explained in a notebook
Holiday effect visualisation: Thanksgiving spike, markdown impact
Implementation Steps
- 1EDA — plot weekly sales trends, seasonality, holiday effects, store variance
- 2Lag Features — create lag_1w, lag_4w, lag_52w (same week last year)
- 3Rolling Features — 4-week and 12-week rolling mean and std, trend indicators
- 4Date Features — week_of_year, month, is_holiday, days_to_holiday
- 5Model — XGBoost with TimeSeriesSplit CV, tune with Optuna or manual grid
- 6Compare — benchmark XGBoost vs LightGBM vs simple moving average baseline
Spam Email Classifier
Build a text classifier that distinguishes spam from legitimate email using TF-IDF and Naive Bayes. The go-to NLP entry point for ML learners — fast to train, easy to interpret, and a classic interview topic.
Classification
NLP
Beginner
Spam Email Classifier
Naive Bayes · SVM · TF-IDF · Text Preprocessing
Classification
NLP
Beginner
Dataset
SMS Spam Collection — UCI ML Repository
Skills you'll practise
NLTK / spaCyTF-IDFNaive BayesSVMText PreprocessingPipelineWordCloud
What to showcase
WordCloud of most frequent spam vs ham terms
sklearn Pipeline combining TF-IDF + classifier — production-ready pattern
Why Naive Bayes works so well for text despite independence assumption
Implementation Steps
- 1Preprocessing — lowercase, remove punctuation, stopwords, stemming/lemmatisation
- 2Vectorise — CountVectorizer vs TF-IDF; explain the difference with examples
- 3Train — MultinomialNB as baseline; compare with LinearSVC and Logistic Regression
- 4Evaluate — focus on recall for spam (you don't want spam in inbox) vs precision tradeoff
- 5Pipeline — wrap TF-IDF + classifier in sklearn Pipeline; show how to predict new messages
Titanic Survival — Ensemble Deep Dive
Go beyond the basic Titanic submission. Combine Gradient Boosting, Random Forest, and Logistic Regression into a stacked ensemble. Ideal for showing advanced feature engineering and model stacking skills.
Classification
Ensemble
Beginner
Titanic Survival — Ensemble Deep Dive
Random Forest · Gradient Boosting · Stacking · Advanced Feature Engineering
Classification
Ensemble
Beginner
Dataset
Titanic — Machine Learning from Disaster (Kaggle)
Skills you'll practise
Random ForestGradientBoostingStackingClassifierTitle ExtractionFamily SizeCabinDeckCross-val Score
What to showcase
Creative feature engineering: extract Title from Name, Deck from Cabin
Stacking with StackingClassifier — show each base model's CV score vs the meta-model
Learning curves to diagnose over/underfitting per model
Implementation Steps
- 1Feature Engineering — extract Title, FamilySize, IsAlone, CabinDeck, FarePerPerson
- 2Imputation — median Age by Title group, "Missing" cabin category
- 3Base Models — Logistic Regression, Random Forest, GradientBoostingClassifier
- 4Stacking — use sklearn's StackingClassifier with LR meta-learner; evaluate with stratified CV
- 5Ablation Study — measure accuracy impact of each engineered feature added/removed
Loan Default Risk Scorer
Score credit applicants by default probability using XGBoost with careful feature selection and business-aware threshold optimisation. Mirrors a real FinTech ML system — high hiring signal.
Classification
XGBoost
Intermediate
Loan Default Risk Scorer
XGBoost · Feature Selection · AUC-ROC · Threshold Optimisation
Classification
XGBoost
Intermediate
Dataset
Give Me Some Credit — Kaggle
Skills you'll practise
XGBoostFeature SelectionMutual InformationAUC-ROCThreshold TuningCalibrationSHAP
What to showcase
Probability calibration — why raw XGBoost scores aren't true probabilities
Business-aware threshold: optimise for expected monetary loss, not F1
SHAP force plot: explain a single applicant's risk score
Implementation Steps
- 1EDA — distribution of delinquencies, debt ratio, income; identify outliers and heavy tails
- 2Imputation — median for monthly income, 0 for NumberOfDependents missing
- 3Feature Selection — mutual information, XGBoost built-in importance, drop multicollinear features
- 4Train XGBoost — stratified k-fold, early stopping, scale_pos_weight for imbalance
- 5Calibration — apply Platt scaling (CalibratedClassifierCV) and compare reliability diagrams
- 6Business Threshold — sweep thresholds, compute expected loss = P(default) × loan_amount for each
Data Science Salary Predictor
Predict salaries across AI/ML/Data roles globally using job title, experience, company size, and remote work ratio. A highly relatable project that resonates with every interviewer in tech.
Regression
XGBoost
Beginner
Data Science Salary Predictor
XGBoost · Gradient Boosting · OrdinalEncoder · Cross-validation
Regression
XGBoost
Beginner
Dataset
Data Science Salaries 2024 — Kaggle
Skills you'll practise
XGBoostOrdinalEncoderTargetEncoderCross-validationRMSE / MAEFeature ImportancePlotly
What to showcase
Interactive Plotly chart: salary by role and experience — eye-catching in a portfolio
Target encoding for job_title (hundreds of categories) — explain vs one-hot
Add a live "predict my salary" widget in the README
Implementation Steps
- 1EDA — salary distributions by role, seniority, country; outlier inspection
- 2Encoding — ordinal for experience_level, target encoding for job_title and company_location
- 3Baseline — Linear Regression with one-hot encoding; set RMSE benchmark
- 4XGBoost — train with 5-fold CV, tune learning_rate and max_depth, evaluate MAE
- 5Visualise — Plotly bar chart of median salary by title; feature importance chart
- 6Predict Tool — simple input() loop or Streamlit widget to predict salary from user inputs
Household Energy Consumption Forecaster
Forecast electricity demand from 2 million+ hourly readings using advanced feature engineering, LightGBM, and a stacked ensemble. An advanced project that demonstrates production-level thinking.
Regression
Ensemble
Time Series
Advanced
Household Energy Consumption Forecaster
LightGBM · XGBoost · Stacked Ensemble · Advanced Feature Engineering
Regression
Ensemble
Time Series
Advanced
Dataset
Individual Household Electric Power Consumption — UCI ML Repository
Skills you'll practise
LightGBMXGBoostStackingFourier FeaturesLarge Dataset HandlingTimeSeriesSplitMAE / MAPE
What to showcase
Handling 2M+ rows efficiently with chunked reading and dtype optimisation
Fourier terms to capture daily/weekly/yearly seasonality without SARIMA
Stacked ensemble: show that blending LightGBM + XGBoost beats each alone
Implementation Steps
- 1Load Efficiently — parse with chunked read_csv, downcast dtypes, resample to hourly
- 2EDA — daily/weekly/seasonal decomposition, missing value patterns, outlier removal
- 3Feature Engineering — hour, dayofweek, month, is_weekend, lag_1h, lag_24h, lag_168h, rolling stats
- 4Fourier Features — sin/cos transforms of hour and day_of_year to encode cyclical patterns
- 5Models — LightGBM and XGBoost with TimeSeriesSplit; compare MAPE per hour-of-day
- 6Stack — average predictions of best LightGBM + XGBoost; show ensemble outperforms both
Project Structure
energy-forecasting/ ├── data/ # raw UCI data + hourly resampled ├── notebooks/ │ ├── 01_eda.ipynb │ ├── 02_features.ipynb │ ├── 03_lgbm.ipynb │ └── 04_ensemble.ipynb ├── src/ │ ├── features.py # all feature engineering │ ├── train.py # train & evaluate pipeline │ └── ensemble.py # stacking logic └── README.md
No projects match this filter