Interview Prep * ML * Deep Learning * MLOps

Crack Your Next ML Interview

400+ questions that actually get asked at top AI/ML companies -- with model answers, follow-ups, and a self-score rubric. Practice like the role is already yours.

0Questions

0 / 0Reviewed

Try Live Simulator →

Beginner → Advanced

Machine Learning

Deep Learning

MLOps

🔍

No questions match your search.

Machine Learning Fundamentals

Core ML concepts, metrics, generalization * beginner -> intermediate

10+ questions

Q1 * Explain the bias-variance tradeoff.

Level: Beginner

Expected answer

Bias-variance tradeoff describes how model complexity affects generalization:

High bias -> underfitting (model too simple, misses patterns).
High variance -> overfitting (model too complex, memorizes noise).
Goal is to find a balance that minimizes total error on unseen data.

Regularization, model choice, and data size all influence this tradeoff.

Follow‑up questions

How does regularization affect bias and variance?
Give an example of a high‑bias model and a high‑variance model.
How would you detect overfitting in practice?

Evaluation rubric

Strong

Clearly defines bias and variance, explains under/overfitting, and connects to regularization.

Mentions under/overfitting but not how to control the tradeoff.

Weak

Vague explanation; confuses bias with data bias or fairness only.

Q2 * What is regularization and why is it used?

Level: Beginner

Expected answer

Regularization adds a penalty term to the loss function to discourage overly complex models:

L2 (Ridge): penalizes squared weights, encourages small but non‑zero weights.
L1 (Lasso): penalizes absolute weights, encourages sparsity and feature selection.
Reduces overfitting by controlling model capacity.

Follow‑up questions

When would you prefer L1 over L2?
How does regularization interact with feature scaling?
What happens if the regularization strength is too high?

Evaluation rubric

Strong

Explains L1 vs L2, connects to overfitting and model complexity, mentions trade‑offs.

Knows it "prevents overfitting" but not how or why different types matter.

Weak

Treats regularization as a generic "tuning trick" with no detail.

Q3 * What is cross‑validation and why is it useful?

Level: Beginner

Expected answer

Cross‑validation splits data into multiple folds to estimate how a model generalizes:

Train on k‑1 folds, validate on the remaining fold; repeat for all folds.
Reduces variance compared to a single train/validation split.
Stratified CV is important for classification with imbalanced classes.

Follow‑up questions

When would you avoid k‑fold CV (e.g., time series)?
How do you adapt CV for time‑dependent data?
How does CV interact with hyperparameter tuning?

Evaluation rubric

Strong

Describes k‑fold clearly, mentions stratification and limitations (e.g., time series).

Knows the basic idea but not when to use different variants.

Weak

Confuses CV with simple train/test split.

Q4 * What is the curse of dimensionality?

Level: Intermediate

Expected answer

The curse of dimensionality refers to phenomena that arise in high‑dimensional spaces:

Distances between points become less meaningful.
Data becomes sparse; models need exponentially more data.
Many algorithms (KNN, clustering) degrade as dimensions grow.

Dimensionality reduction (PCA, feature selection) helps mitigate this.

Follow‑up questions

How does PCA help with high‑dimensional data?
When would you prefer feature selection over PCA?
How does the curse of dimensionality affect KNN?

Evaluation rubric

Strong

Explains sparsity, distance issues, and connects to specific algorithms and mitigation methods.

Gives a high‑level definition without concrete implications or examples.

Weak

Very vague; no connection to model performance.

Q5 * Why can accuracy be misleading for imbalanced datasets?

Level: Intermediate

Expected answer

In imbalanced datasets, a model can achieve high accuracy by predicting the majority class only:

Accuracy ignores class distribution and costs of different errors.
Metrics like precision, recall, F1, and ROC/PR curves are more informative.
For rare events (fraud, disease), recall and precision are critical.

Follow‑up questions

When would you optimize for recall over precision?
How do you choose a decision threshold?
How do ROC and PR curves differ in interpretation?

Evaluation rubric

Strong

Explains majority‑class issue, suggests better metrics, and ties to real‑world examples.

Knows accuracy is "not good" but doesn't articulate why or what to use instead.

Weak

Treats accuracy as always sufficient.

Q6 · Explain gradient descent and its main variants.

Level: Beginner

Expected answer

Gradient descent minimises a loss function by iteratively updating parameters in the direction of steepest descent:

Batch GD — uses all training examples per update; stable but slow on large datasets.
Stochastic GD (SGD) — one example per update; fast and noisy, can escape local minima.
Mini-batch GD — compromise: small batches (32–256); most common in practice.
Adam — adaptive learning rates + momentum; converges faster and is robust to hyperparameter choice.

The learning rate controls step size — too large causes divergence, too small causes slow convergence.

Follow‑up questions

What is the intuition behind momentum in optimisers?
When would you choose SGD over Adam?
How do learning rate schedules (cosine decay, warmup) help training?

Evaluation rubric

Strong

Explains all variants, discusses learning rate sensitivity, mentions Adam internals (m/v moments), compares when each is preferred.

Knows batch vs SGD vs mini-batch and that Adam adapts learning rates.

Weak

Only knows 'gradient descent updates weights to minimise loss' with no variant knowledge.

Q7 · What is feature engineering and which techniques do you commonly use?

Level: Intermediate

Expected answer

Feature engineering transforms raw data into representations that make models more effective:

Encoding — one-hot, ordinal, target encoding for categoricals.
Scaling — standardisation (z-score) and min-max normalisation; critical for distance-based models.
Binning / discretisation — converts continuous features to ordinal buckets; reduces noise.
Interaction features — products or ratios of existing features to capture non-linear relationships.
Temporal features — extract hour, day-of-week, lag features from timestamps.
Dimensionality reduction — PCA, t-SNE, UMAP to reduce redundancy.

Andrew Ng: “Coming up with features is difficult, time-consuming, requires expert knowledge.”

Follow‑up questions

How do you handle high-cardinality categorical variables?
When is feature scaling not necessary?
What is target leakage and how do you prevent it?

Evaluation rubric

Strong

Covers encoding, scaling, interactions, temporal features, and discusses leakage prevention. Mentions AutoML/feature stores.

Names a few techniques but can't discuss trade-offs or pitfalls like leakage.

Weak

Only mentions 'one-hot encoding' with no broader understanding.

Q8 · Compare bagging and boosting. Give examples of each.

Level: Intermediate

Expected answer

Both are ensemble methods that combine weak learners:

Bagging (Bootstrap Aggregating) — trains learners independently on bootstrap samples; combines by averaging or voting. Reduces variance. Example: Random Forest.
Boosting — trains learners sequentially; each focuses on errors of the previous. Reduces bias. Examples: XGBoost, LightGBM, AdaBoost.

Key trade-offs:

Bagging: parallelisable, robust to outliers, good when high variance is the problem.
Boosting: often better accuracy, but sensitive to noisy data and can overfit.
XGBoost adds regularisation (L1/L2) to boosting — best-of-both in practice.

Follow‑up questions

Why does Random Forest reduce variance but not bias?
What is gradient boosting and how does it differ from AdaBoost?
When would you prefer a single decision tree over an ensemble?

Evaluation rubric

Strong

Clearly distinguishes bias vs variance reduction, explains gradient boosting mechanics, discusses XGBoost regularisation.

Knows RF = bagging and XGBoost = boosting but can't explain the variance/bias mechanism.

Weak

Can name examples but cannot explain why they work differently.

Q9 · How do you handle class imbalance in a classification problem?

Level: Intermediate

Expected answer

Class imbalance (e.g., fraud detection: 0.1% positive) degrades model performance on the minority class:

Resampling — oversample minority (SMOTE) or undersample majority.
Class weights — pass class_weight='balanced' to sklearn; equivalent to oversampling mathematically.
Threshold tuning — lower the decision threshold to increase recall on minority class.
Evaluation metrics — use F1, PR-AUC, or MCC instead of accuracy; accuracy is misleading under imbalance.
Algorithm choice — tree-based models (XGBoost with scale_pos_weight) handle imbalance better than logistic regression out of the box.

Follow‑up questions

When would you use SMOTE vs class weighting?
Why is accuracy a misleading metric for imbalanced datasets?
How do you evaluate a model when positive class prevalence is 0.1%?

Evaluation rubric

Strong

Covers multiple strategies, knows when to use each, explains why accuracy fails, and discusses PR-AUC vs ROC-AUC for imbalanced data.

Knows SMOTE and class weights but can't explain when to use which.

Weak

Only says 'use more data' or 'use F1 score' without deeper explanation.

Q10 · Explain cross-validation strategies and when to deviate from k-fold.

Level: Intermediate

Expected answer

Cross-validation estimates generalisation error by splitting data into train/validation folds:

k-fold — standard; splits data into k equal folds, trains k times. Typical k = 5 or 10.
Stratified k-fold — preserves class distribution per fold; essential for imbalanced classification.
Time-series CV — always train on past, validate on future (expanding or rolling window); never use standard k-fold on time series — data leakage.
Group k-fold — ensures samples from the same group (e.g., patient) are not split across folds; prevents leakage in grouped data.
Leave-One-Out (LOO) — k = n; low bias but high variance and computationally expensive; use only on very small datasets.

Follow‑up questions

Why does standard k-fold fail for time series data?
What is nested cross-validation and when is it needed?
How do you choose k in k-fold cross-validation?

Evaluation rubric

Strong

Knows all variants, correctly identifies failure modes (time series, grouped data), explains nested CV for hyperparameter tuning.

Knows k-fold and stratified k-fold but unsure about time-series or grouped CV.

Weak

Thinks cross-validation is just 'splitting data into train and test'.

Deep Learning

Neural networks, training dynamics, architectures * intermediate -> advanced

10 questions

Q1 * Explain backpropagation in neural networks.

Level: Intermediate

Expected answer

Backpropagation computes gradients of the loss with respect to weights using the chain rule:

Forward pass: compute outputs and loss.
Backward pass: propagate gradients layer by layer.
Optimizer (SGD, Adam) updates weights using these gradients.

It enables efficient training of deep networks by reusing intermediate activations.

Follow‑up questions

Why do we need non‑linear activation functions?
What happens if activations saturate?
How does batch size affect gradient estimates?

Evaluation rubric

Strong

Describes forward/backward passes, gradients, and optimizer roles clearly.

High‑level idea only; lacks detail on chain rule or gradient flow.

Weak

Confuses backprop with generic "feedback" or trial‑and‑error.

Q2 * What causes vanishing and exploding gradients?

Level: Intermediate

Expected answer

In deep networks, repeated multiplication of gradients through layers can:

Shrink towards zero (vanishing) with certain activations (sigmoid, tanh).
Grow very large (exploding) with poor initialization or deep stacks.

Mitigations include ReLU‑like activations, careful initialization, residual connections, and normalization.

Follow‑up questions

How do residual connections help?
Why are LSTMs more robust than vanilla RNNs?
What role does gradient clipping play?

Evaluation rubric

Strong

Explains the math intuition and lists multiple mitigation strategies with examples.

Knows it "happens in deep networks" but not why or how to fix it properly.

Weak

No clear understanding of gradient behavior in deep nets.

Q3 * Compare CNNs, RNNs, and Transformers.

Level: Intermediate

Expected answer

CNNs: local receptive fields, weight sharing; great for images and spatial data.
RNNs: sequential processing with hidden state; good for sequences but hard to parallelize.
Transformers: self‑attention, fully parallel, capture long‑range dependencies efficiently.

Transformers have largely replaced RNNs in NLP due to better scaling and performance.

Follow‑up questions

Why is self‑attention more flexible than fixed convolution kernels?
When might you still use CNNs today?
How do Transformers handle very long sequences?

Evaluation rubric

Strong

Clear comparison with use cases and trade‑offs; mentions parallelism and long‑range context.

Knows basic differences but not why Transformers dominate modern NLP/CV tasks.

Weak

Confuses architectures or gives very shallow distinctions.

Q4 * What is batch normalization and why is it used?

Level: Intermediate

Expected answer

Batch normalization normalizes activations within a mini‑batch:

Reduces internal covariate shift.
Stabilizes and speeds up training.
Allows higher learning rates and can have a regularization effect.

In Transformers, LayerNorm is more common due to different architecture patterns.

Follow‑up questions

Why is LayerNorm preferred in Transformers?
What issues arise with very small batch sizes?
How does batch norm interact with dropout?

Evaluation rubric

Strong

Explains normalization, training stability, and mentions alternatives like LayerNorm/GroupNorm.

Knows it "helps training" but not the mechanics or trade‑offs.

Weak

No clear understanding of normalization layers.

Q6 · Explain backpropagation. How are gradients computed?

Level: Intermediate

Expected answer

Backpropagation applies the chain rule to compute gradients of the loss with respect to each parameter:

Forward pass — compute predictions and loss, caching intermediate activations.
Backward pass — propagate gradients from output layer to input, layer by layer.
Chain rule: ∂L/∂w = (∂L/∂a) × (∂a/∂z) × (∂z/∂w)
Vanishing gradients — gradients shrink as they propagate back through many layers (especially with sigmoid). Fixed by ReLU, batch norm, residual connections.
Exploding gradients — gradients grow uncontrollably. Fixed by gradient clipping.

Follow‑up questions

Why does ReLU help with vanishing gradients compared to sigmoid?
What is gradient clipping and when do you use it?
How do residual connections (ResNet) ease backpropagation in deep networks?

Evaluation rubric

Strong

Explains chain rule correctly, identifies vanishing/exploding gradient problems, knows the fixes (ReLU, BatchNorm, residuals, clipping).

Understands the forward-backward structure but struggles with the chain rule mechanics or gradient pathologies.

Weak

Describes backprop as 'adjusting weights based on error' without any mathematical intuition.

Q7 · What is batch normalisation and why does it help?

Level: Intermediate

Expected answer

Batch normalisation normalises layer inputs across a mini-batch to have zero mean and unit variance, then applies learned scale (γ) and shift (β):

Reduces internal covariate shift — the distribution of layer inputs no longer changes drastically during training.
Allows higher learning rates — reduces sensitivity to weight initialisation.
Acts as a regulariser — slightly reduces need for dropout.
Applied before activation in original paper; often after in practice.
Layer norm — normalises across features instead of the batch; preferred for transformers where batch size is small or variable.

Follow‑up questions

What are the learned parameters in batch normalisation?
Why is layer normalisation preferred over batch normalisation in transformers?
What happens to batch normalisation at inference time?

Evaluation rubric

Strong

Explains normalisation mechanics, learned parameters, difference from layer norm, and inference-time behaviour (running statistics).

Knows BatchNorm reduces training instability but can't explain the mechanism or inference behaviour.

Weak

Says 'it normalises inputs' without knowing why or the learned parameters.

Q8 · When should you use a CNN vs RNN vs Transformer for sequence data?

Level: Advanced

Expected answer

Each architecture has different strengths for sequential data:

CNN — captures local patterns via convolutional filters; fast, parallelisable, good for fixed-length sequences with local structure (e.g., character-level text, time series with local periodicity). No inherent position encoding.
RNN/LSTM/GRU — maintains hidden state across timesteps; natural for variable-length sequences. Struggles with very long-range dependencies (vanishing gradients). Sequential — hard to parallelise.
Transformer — self-attention models all pairwise relationships simultaneously; handles long-range dependencies well; fully parallelisable. Quadratic complexity in sequence length — costly for very long sequences.

In 2026, transformers dominate most sequence tasks due to scale and parallelism.

Follow‑up questions

How do positional encodings solve the transformer's lack of sequence order?
What is the complexity of self-attention and how does it scale with sequence length?
When would you still use an LSTM over a transformer today?

Evaluation rubric

Strong

Clear comparison of trade-offs (local vs global, sequential vs parallel, complexity), knows when each still applies, mentions attention complexity O(n²).

Knows CNNs are local and transformers use attention but can't reason through trade-offs clearly.

Weak

Says 'use transformers for NLP' without understanding the architectural distinctions.

Q9 · What is transfer learning? Explain fine-tuning vs feature extraction.

Level: Intermediate

Expected answer

Transfer learning reuses a model trained on a large dataset (source domain) for a new task (target domain):

Feature extraction — freeze pretrained layers, only train a new classification head. Fast, little data needed, but limited adaptation.
Fine-tuning — unfreeze some or all pretrained layers and continue training on target data. More flexible, requires more data and care to avoid catastrophic forgetting.
Full fine-tuning — unfreeze all layers; needs substantial data and compute.
PEFT methods (LoRA, QLoRA) — freeze most parameters, train small adapters; achieves fine-tuning quality at fraction of the cost.

Key rule: the more similar source and target domains, the better transfer works.

Follow‑up questions

When would transfer learning NOT work well?
What is catastrophic forgetting and how can you mitigate it?
Explain LoRA and why it is efficient for fine-tuning large models.

Evaluation rubric

Strong

Distinguishes feature extraction from fine-tuning, knows PEFT/LoRA, discusses catastrophic forgetting and domain similarity conditions.

Understands the concept but can't explain PEFT methods or when transfer fails.

Weak

Says 'use a pretrained model and train on your data' without understanding the mechanics.

MLOps

Production ML, pipelines, monitoring, CI/CD * intermediate -> advanced

8 questions

Q1 * What is MLOps and how is it different from traditional ML?

Level: Intermediate

Expected answer

MLOps focuses on operationalizing ML models:

End‑to‑end lifecycle: data, training, deployment, monitoring, retraining.
Emphasizes reliability, reproducibility, automation, and collaboration.
Bridges ML with DevOps practices (CI/CD, infra as code, observability).

Traditional ML often stops at model training and offline evaluation.

Follow‑up questions

What are the main challenges when moving from notebook to production?
How do you structure teams around MLOps?
What tools have you used for MLOps?

Evaluation rubric

Strong

Clear distinction between experimentation and production; mentions lifecycle, automation, and tooling.

High‑level definition without concrete practices or examples.

Weak

Treats MLOps as just "deploying models with Docker".

Q2 * What is a feature store and why is it important?

Level: Intermediate

Expected answer

A feature store is a centralized system for managing ML features:

Stores feature definitions, values, and metadata.
Serves features consistently for training and online inference.
Helps prevent training/serving skew and duplicate feature logic.

Follow‑up questions

How does a feature store integrate with batch and streaming data?
What problems arise without a feature store?
Have you used any feature store tools (Feast, Tecton, etc.)?

Evaluation rubric

Strong

Explains consistency, reuse, and skew prevention with concrete examples of usage.

Knows it "stores features" but not why it matters for production ML.

Weak

No clear understanding of feature management challenges.

Q3 * Explain data drift vs model drift. How do you detect them?

Level: Intermediate

Expected answer

Data drift: input distribution changes over time (e.g., new user behavior).
Model drift: model performance degrades, even if inputs look similar.
Detection: monitor feature distributions, PSI, performance metrics, and business KPIs.

Retraining, recalibration, or model replacement may be needed depending on the cause.

Follow‑up questions

How would you set thresholds for drift alerts?
What's your strategy for safe retraining?
How do you handle concept drift in streaming systems?

Evaluation rubric

Strong

Distinguishes data vs model drift clearly and proposes concrete monitoring strategies.

Knows drift exists but not how to detect or respond systematically.

Weak

No clear concept of drift or its impact on production systems.

Q4 * How would you design CI/CD for ML models?

Level: Advanced

Expected answer

CI/CD for ML extends software CI/CD with ML‑specific steps:

CI: unit tests, data validation, training pipeline tests, reproducibility checks.
CD: automated deployment to staging, canary or shadow deployments, rollback strategies.
Model registry, versioning, and approval workflows for promotion to production.

Follow‑up questions

What tests are unique to ML compared to standard software?
How do you handle model rollbacks?
How would you integrate data quality checks into the pipeline?

Evaluation rubric

Strong

Describes a full pipeline with tests, staging, canary/shadow, registry, and monitoring hooks.

Talks about "deploying models with CI/CD" but lacks ML‑specific steps or safeguards.

Weak

No understanding of how CI/CD changes for ML workloads.

Q5 · What is the difference between data drift and concept drift? How do you detect each?

Level: Advanced

Expected answer

Both indicate model degradation but from different causes:

Data drift (covariate shift) — the distribution of input features changes (e.g., user demographics shift). The label relationship P(y|x) remains the same, but the model sees unfamiliar inputs. Detection: statistical tests (KS test, PSI) comparing training vs serving feature distributions.
Concept drift — the relationship between features and labels changes (e.g., customer behaviour changes post-pandemic). P(y|x) itself changes. Detection: monitor model predictions and outcomes over time; requires ground truth labels (often delayed).

Monitoring stack: Evidently AI, WhyLabs, Grafana + custom metrics. Response: retrain on recent data, adjust feature engineering, or flag for human review.

Follow‑up questions

How do you handle concept drift when labels are delayed by 30 days?
What is Population Stability Index (PSI) and when is it used?
How often should you retrain a model in production?

Evaluation rubric

Strong

Clearly distinguishes both types, knows statistical tests for data drift, discusses delayed labels problem for concept drift, mentions monitoring tools.

Knows both terms but conflates them or can't describe detection approaches.

Weak

Only knows 'the model gets worse over time' without knowing the types or detection methods.

Q6 · Describe a CI/CD pipeline for ML models. How does it differ from software CI/CD?

Level: Advanced

Expected answer

ML CI/CD automates the train-evaluate-deploy lifecycle:

Continuous Integration — test data pipelines, validate schemas, run unit tests on feature transformations and model code.
Continuous Training — retrain models on new data when triggered (schedule, drift detection, or new data threshold). Log to MLflow/W&B.
Continuous Delivery — evaluate challenger model against champion on a held-out test set; promote only if metrics improve beyond threshold.
Deployment — canary/shadow/A-B deployment; monitor serving metrics before full rollout.

Key difference from software CI/CD: model quality depends on data, not just code. Data versioning (DVC), experiment tracking, and model registry are required. A green test suite does not guarantee a good model.

Follow‑up questions

What tools would you use for each stage of an ML CI/CD pipeline?
How do you test a data pipeline in a CI step?
What triggers a model retrain in your pipeline?

Evaluation rubric

Strong

Covers all 4 stages, explains data vs code testing, knows MLflow/DVC/model registry, discusses shadow deployment and rollback.

Knows the general concept but can't detail the model evaluation or deployment stages.

Weak

Describes it as 'like software CI/CD but with model training added'.

Q7 · What is a feature store? When would you use one?

Level: Advanced

Expected answer

A feature store is a centralised data layer for storing, versioning and serving ML features:

Online store — low-latency key-value store (Redis, DynamoDB) for real-time model serving.
Offline store — column-oriented storage (Parquet, BigQuery) for training data retrieval.
Feature definitions — written once in Python, computed consistently for both training and serving — eliminates training/serving skew.

Use when:

Multiple models share the same features.
Feature computation is expensive and reuse saves cost.
Training/serving skew is a recurring production problem.

Popular options: Feast (open-source), Tecton, Hopsworks, Vertex AI Feature Store.

Follow‑up questions

What is training/serving skew and how does a feature store prevent it?
How does a feature store handle point-in-time correctness for training data?
When is a feature store overkill?

Evaluation rubric

Strong

Explains online vs offline stores, training/serving skew, point-in-time joins, names real tools, and knows when NOT to use one.

Knows it centralises features but can't explain training/serving skew or point-in-time correctness.

Weak

Describes it as 'a database for features' without knowing the online/offline split.

Q8 · How do you design an A/B test for a new ML model in production?

Level: Advanced

Expected answer

A/B testing validates whether a new model improves real business metrics:

Randomised traffic split — route x% of users to challenger, rest to champion. Ensure randomisation is at user level (not request) to avoid contamination.
Metrics — define primary (business KPI) and secondary (model quality) metrics upfront. Avoid HARKing.
Power analysis — calculate minimum sample size to detect expected effect with α=0.05, β=0.2 before starting.
Duration — run long enough to capture weekly cycles; minimum 1–2 weeks.
Guardrails — monitor secondary metrics and auto-rollback if latency, error rate or engagement degrades.
Shadow mode — run challenger in parallel without serving results; validates predictions before any traffic split.

Follow‑up questions

How do you handle novelty effect bias in an A/B test?
What is the multiple comparisons problem and how does it affect ML A/B tests?
When would you use multi-armed bandit instead of A/B testing?

Evaluation rubric

Strong

Covers randomisation, power analysis, business vs model metrics, guardrails, and mentions shadow mode and multi-armed bandits as alternatives.

Knows traffic splitting and statistical significance but misses power analysis, guardrails, or novelty effects.

Weak

Says 'split traffic 50/50 and see which performs better' with no statistical rigour.

ML Coding Tasks (Python)

Hands‑on ML and DL exercises * intermediate -> advanced

8 questions

Q1 * Implement logistic regression from scratch.

Level: Intermediate

Expected answer

Candidate should outline:

Sigmoid function for probabilities.
Binary cross‑entropy loss.
Gradient descent or mini‑batch gradient descent updates.
Convergence criteria and evaluation on a validation set.

Exact syntax is less important than a correct mathematical and implementation flow.

Follow‑up questions

How would you add L2 regularization?
How do you handle numerical stability in the sigmoid?
How would you extend this to multi‑class classification?

Evaluation rubric

Strong

Correct loss, gradients, and update loop; mentions regularization and stability concerns.

Understands high‑level idea but struggles with gradient derivation or implementation details.

Weak

Cannot outline a working training loop or loss function.

Q2 * Write a function to compute F1 score given y_true and y_pred.

Level: Intermediate

Expected answer

Candidate should:

Compute TP, FP, FN from predictions and labels.
Compute precision = TP / (TP + FP), recall = TP / (TP + FN).
Compute F1 = 2 * precision * recall / (precision + recall).
Handle edge cases (division by zero).

Follow‑up questions

How would you extend this to multi‑class (macro vs micro F1)?
When is F1 more useful than accuracy?
How would you choose a threshold to maximize F1?

Evaluation rubric

Strong

Correct implementation and understanding of precision/recall trade‑offs and edge cases.

Knows formula but may miss edge cases or multi‑class extensions.

Weak

Confuses F1 with accuracy or other metrics.

Welcome to CareerStack

Crack Your Next ML Interview