Interview Prep * ML * Deep Learning * MLOps

Crack Your Next ML Interview

400+ questions that actually get asked at top AI/ML companies -- with model answers, follow-ups, and a self-score rubric. Practice like the role is already yours.

Beginner → Advanced

Machine Learning

Deep Learning

MLOps

Machine Learning Fundamentals

Core ML concepts, metrics, generalization * beginner -> intermediate

10+ questions

Q1 * Explain the bias-variance tradeoff.

Level: Beginner

Expected answer

Bias-variance tradeoff describes how model complexity affects generalization:

High bias -> underfitting (model too simple, misses patterns).
High variance -> overfitting (model too complex, memorizes noise).
Goal is to find a balance that minimizes total error on unseen data.

Regularization, model choice, and data size all influence this tradeoff.

Follow‑up questions

How does regularization affect bias and variance?
Give an example of a high‑bias model and a high‑variance model.
How would you detect overfitting in practice?

Evaluation rubric

Strong

Clearly defines bias and variance, explains under/overfitting, and connects to regularization.

Mentions under/overfitting but not how to control the tradeoff.

Weak

Vague explanation; confuses bias with data bias or fairness only.

Q2 * What is regularization and why is it used?

Level: Beginner

Expected answer

Regularization adds a penalty term to the loss function to discourage overly complex models:

L2 (Ridge): penalizes squared weights, encourages small but non‑zero weights.
L1 (Lasso): penalizes absolute weights, encourages sparsity and feature selection.
Reduces overfitting by controlling model capacity.

Follow‑up questions

When would you prefer L1 over L2?
How does regularization interact with feature scaling?
What happens if the regularization strength is too high?

Evaluation rubric

Strong

Explains L1 vs L2, connects to overfitting and model complexity, mentions trade‑offs.

Knows it "prevents overfitting" but not how or why different types matter.

Weak

Treats regularization as a generic "tuning trick" with no detail.

Q3 * What is cross‑validation and why is it useful?

Level: Beginner

Expected answer

Cross‑validation splits data into multiple folds to estimate how a model generalizes:

Train on k‑1 folds, validate on the remaining fold; repeat for all folds.
Reduces variance compared to a single train/validation split.
Stratified CV is important for classification with imbalanced classes.

Follow‑up questions

When would you avoid k‑fold CV (e.g., time series)?
How do you adapt CV for time‑dependent data?
How does CV interact with hyperparameter tuning?

Evaluation rubric

Strong

Describes k‑fold clearly, mentions stratification and limitations (e.g., time series).

Knows the basic idea but not when to use different variants.

Weak

Confuses CV with simple train/test split.

Q4 * What is the curse of dimensionality?

Level: Intermediate

Expected answer

The curse of dimensionality refers to phenomena that arise in high‑dimensional spaces:

Distances between points become less meaningful.
Data becomes sparse; models need exponentially more data.
Many algorithms (KNN, clustering) degrade as dimensions grow.

Dimensionality reduction (PCA, feature selection) helps mitigate this.

Follow‑up questions

How does PCA help with high‑dimensional data?
When would you prefer feature selection over PCA?
How does the curse of dimensionality affect KNN?

Evaluation rubric

Strong

Explains sparsity, distance issues, and connects to specific algorithms and mitigation methods.

Gives a high‑level definition without concrete implications or examples.

Weak

Very vague; no connection to model performance.

Q5 * Why can accuracy be misleading for imbalanced datasets?

Level: Intermediate

Expected answer

In imbalanced datasets, a model can achieve high accuracy by predicting the majority class only:

Accuracy ignores class distribution and costs of different errors.
Metrics like precision, recall, F1, and ROC/PR curves are more informative.
For rare events (fraud, disease), recall and precision are critical.

Follow‑up questions

When would you optimize for recall over precision?
How do you choose a decision threshold?
How do ROC and PR curves differ in interpretation?

Evaluation rubric

Strong

Explains majority‑class issue, suggests better metrics, and ties to real‑world examples.

Knows accuracy is "not good" but doesn't articulate why or what to use instead.

Weak

Treats accuracy as always sufficient.

Deep Learning

Neural networks, training dynamics, architectures * intermediate -> advanced

8+ questions

Q1 * Explain backpropagation in neural networks.

Level: Intermediate

Expected answer

Backpropagation computes gradients of the loss with respect to weights using the chain rule:

Forward pass: compute outputs and loss.
Backward pass: propagate gradients layer by layer.
Optimizer (SGD, Adam) updates weights using these gradients.

It enables efficient training of deep networks by reusing intermediate activations.

Follow‑up questions

Why do we need non‑linear activation functions?
What happens if activations saturate?
How does batch size affect gradient estimates?

Evaluation rubric

Strong

Describes forward/backward passes, gradients, and optimizer roles clearly.

High‑level idea only; lacks detail on chain rule or gradient flow.

Weak

Confuses backprop with generic "feedback" or trial‑and‑error.

Q2 * What causes vanishing and exploding gradients?

Level: Intermediate

Expected answer

In deep networks, repeated multiplication of gradients through layers can:

Shrink towards zero (vanishing) with certain activations (sigmoid, tanh).
Grow very large (exploding) with poor initialization or deep stacks.

Mitigations include ReLU‑like activations, careful initialization, residual connections, and normalization.

Follow‑up questions

How do residual connections help?
Why are LSTMs more robust than vanilla RNNs?
What role does gradient clipping play?

Evaluation rubric

Strong

Explains the math intuition and lists multiple mitigation strategies with examples.

Knows it "happens in deep networks" but not why or how to fix it properly.

Weak

No clear understanding of gradient behavior in deep nets.

Q3 * Compare CNNs, RNNs, and Transformers.

Level: Intermediate

Expected answer

CNNs: local receptive fields, weight sharing; great for images and spatial data.
RNNs: sequential processing with hidden state; good for sequences but hard to parallelize.
Transformers: self‑attention, fully parallel, capture long‑range dependencies efficiently.

Transformers have largely replaced RNNs in NLP due to better scaling and performance.

Follow‑up questions

Why is self‑attention more flexible than fixed convolution kernels?
When might you still use CNNs today?
How do Transformers handle very long sequences?

Evaluation rubric

Strong

Clear comparison with use cases and trade‑offs; mentions parallelism and long‑range context.

Knows basic differences but not why Transformers dominate modern NLP/CV tasks.

Weak

Confuses architectures or gives very shallow distinctions.

Q4 * What is batch normalization and why is it used?

Level: Intermediate

Expected answer

Batch normalization normalizes activations within a mini‑batch:

Reduces internal covariate shift.
Stabilizes and speeds up training.
Allows higher learning rates and can have a regularization effect.

In Transformers, LayerNorm is more common due to different architecture patterns.

Follow‑up questions

Why is LayerNorm preferred in Transformers?
What issues arise with very small batch sizes?
How does batch norm interact with dropout?

Evaluation rubric

Strong

Explains normalization, training stability, and mentions alternatives like LayerNorm/GroupNorm.

Knows it "helps training" but not the mechanics or trade‑offs.

Weak

No clear understanding of normalization layers.

MLOps

Production ML, pipelines, monitoring, CI/CD * intermediate -> advanced

10+ questions

Q1 * What is MLOps and how is it different from traditional ML?

Level: Intermediate

Expected answer

MLOps focuses on operationalizing ML models:

End‑to‑end lifecycle: data, training, deployment, monitoring, retraining.
Emphasizes reliability, reproducibility, automation, and collaboration.
Bridges ML with DevOps practices (CI/CD, infra as code, observability).

Traditional ML often stops at model training and offline evaluation.

Follow‑up questions

What are the main challenges when moving from notebook to production?
How do you structure teams around MLOps?
What tools have you used for MLOps?

Evaluation rubric

Strong

Clear distinction between experimentation and production; mentions lifecycle, automation, and tooling.

High‑level definition without concrete practices or examples.

Weak

Treats MLOps as just "deploying models with Docker".

Q2 * What is a feature store and why is it important?

Level: Intermediate

Expected answer

A feature store is a centralized system for managing ML features:

Stores feature definitions, values, and metadata.
Serves features consistently for training and online inference.
Helps prevent training/serving skew and duplicate feature logic.

Follow‑up questions

How does a feature store integrate with batch and streaming data?
What problems arise without a feature store?
Have you used any feature store tools (Feast, Tecton, etc.)?

Evaluation rubric

Strong

Explains consistency, reuse, and skew prevention with concrete examples of usage.

Knows it "stores features" but not why it matters for production ML.

Weak

No clear understanding of feature management challenges.

Q3 * Explain data drift vs model drift. How do you detect them?

Level: Intermediate

Expected answer

Data drift: input distribution changes over time (e.g., new user behavior).
Model drift: model performance degrades, even if inputs look similar.
Detection: monitor feature distributions, PSI, performance metrics, and business KPIs.

Retraining, recalibration, or model replacement may be needed depending on the cause.

Follow‑up questions

How would you set thresholds for drift alerts?
What's your strategy for safe retraining?
How do you handle concept drift in streaming systems?

Evaluation rubric

Strong

Distinguishes data vs model drift clearly and proposes concrete monitoring strategies.

Knows drift exists but not how to detect or respond systematically.

Weak

No clear concept of drift or its impact on production systems.

Q4 * How would you design CI/CD for ML models?

Level: Advanced

Expected answer

CI/CD for ML extends software CI/CD with ML‑specific steps:

CI: unit tests, data validation, training pipeline tests, reproducibility checks.
CD: automated deployment to staging, canary or shadow deployments, rollback strategies.
Model registry, versioning, and approval workflows for promotion to production.

Follow‑up questions

What tests are unique to ML compared to standard software?
How do you handle model rollbacks?
How would you integrate data quality checks into the pipeline?

Evaluation rubric

Strong

Describes a full pipeline with tests, staging, canary/shadow, registry, and monitoring hooks.

Talks about "deploying models with CI/CD" but lacks ML‑specific steps or safeguards.

Weak

No understanding of how CI/CD changes for ML workloads.

ML Coding Tasks (Python)

Hands‑on ML and DL exercises * intermediate -> advanced

6+ tasks

Q1 * Implement logistic regression from scratch.

Level: Intermediate

Expected answer

Candidate should outline:

Sigmoid function for probabilities.
Binary cross‑entropy loss.
Gradient descent or mini‑batch gradient descent updates.
Convergence criteria and evaluation on a validation set.

Exact syntax is less important than a correct mathematical and implementation flow.

Follow‑up questions

How would you add L2 regularization?
How do you handle numerical stability in the sigmoid?
How would you extend this to multi‑class classification?

Evaluation rubric

Strong

Correct loss, gradients, and update loop; mentions regularization and stability concerns.

Understands high‑level idea but struggles with gradient derivation or implementation details.

Weak

Cannot outline a working training loop or loss function.

Q2 * Write a function to compute F1 score given y_true and y_pred.

Level: Intermediate

Expected answer

Candidate should:

Compute TP, FP, FN from predictions and labels.
Compute precision = TP / (TP + FP), recall = TP / (TP + FN).
Compute F1 = 2 * precision * recall / (precision + recall).
Handle edge cases (division by zero).

Follow‑up questions

How would you extend this to multi‑class (macro vs micro F1)?
When is F1 more useful than accuracy?
How would you choose a threshold to maximize F1?

Evaluation rubric

Strong

Correct implementation and understanding of precision/recall trade‑offs and edge cases.

Knows formula but may miss edge cases or multi‑class extensions.

Weak

Confuses F1 with accuracy or other metrics.

Welcome to CareerStack

Crack Your Next ML Interview