Practice 30 data science interview questions for 2026, with answer frameworks for statistics, ML, SQL, Python, behavioral rounds, and senior tooling.
Data Science Interview Questions: 30 Most Asked (2026)
Data science interview questions in 2026 test more than textbook definitions. A typical hiring loop runs from recruiter screen through online assessment, technical rounds, a case study or take-home, and a behavioral panel. Each stage filters for something different, and the questions below reflect what actually shows up across all of them.
This guide covers 30 questions organized by category: statistics, machine learning, SQL and Python, behavioral, and advanced/modern tooling. Each answer is written the way a strong candidate would deliver it — direct, evidence-backed, and structured for speaking out loud. Whether you're preparing for your first data science role or your fifth, the goal is the same: understand what each question is testing and answer with reasoning, not memorization.
What interviewers are actually testing
Before getting into specific questions, it helps to know the three signals every interviewer is listening for:
- Technical depth — Can you reason through statistics, ML, and SQL under pressure? Not recite definitions, but explain trade-offs and edge cases.
- Business judgment — Can you connect a model to a real outcome? A churn model that saves $1M matters more than a churn model with 92% accuracy.
- Communication — Can you explain a trade-off to a non-technical stakeholder without losing precision?
Hiring managers consistently weight adaptability and willingness to learn over narrow tool expertise. Knowing Tableau inside out matters less than showing you can pick up a new tool when the team needs it.
One more thing: the STAR framework (Situation, Task, Action, Result) is useful beyond behavioral rounds. It structures project answers too, and interviewers notice when a candidate can walk through a technical project with the same narrative clarity they'd use for a conflict story.
Data science interview questions — statistics and probability
These come up in almost every technical screen. Get comfortable with the reasoning, not just the formulas.
What is a p value, and what does it not tell you?
A p-value is the probability of observing results at least as extreme as the ones you got, assuming the null hypothesis is true. It does not tell you the probability that the null hypothesis is correct. It also doesn't tell you the size of the effect or whether the result matters in practice. A p-value of 0.03 on a conversion rate change of 0.001% is statistically significant and practically useless.
Explain Type I vs. Type II errors with a real example.
Type I: you reject the null when it's actually true (false positive). Type II: you fail to reject the null when it's actually false (false negative). In a fraud detection system, a Type I error flags a legitimate transaction as fraud — annoying for the customer. A Type II error lets actual fraud through — costly for the business. Which one you optimize against depends on the cost structure, not on convention.
What is the Central Limit Theorem and why does it matter in practice?
The CLT says that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the population's original distribution. In practice, this is why you can use z-tests and confidence intervals on non-normal data, as long as your sample is large enough. It's the foundation of most A/B testing frameworks.
How do you handle missing data when it's spread within one standard deviation of the median?
If missing values cluster within one standard deviation of the median in a roughly normal distribution, approximately 68% of the data remains unaffected. The missingness is concentrated in the densest part of the distribution, so imputation with the median or mean is reasonable — you're not introducing extreme bias. But check whether the missingness is random first. If it's systematic (e.g., users who churned never filled in a field), imputation masks the signal you actually need.
What's the difference between correlation and causation? Give a scenario where confusing them would be costly.
Correlation measures linear association. Causation requires a mechanism. Ice cream sales and drowning deaths are correlated — both rise in summer. If a city cut ice cream sales to reduce drownings, they'd waste money and save no one. In a business context: if you observe that users who complete onboarding have higher retention, that doesn't mean forcing everyone through onboarding will improve retention. The completers might just be more motivated users.
Walk me through how you'd design an A/B test from scratch.
Define the metric you're optimizing (primary KPI) and a guardrail metric you don't want to hurt. Calculate the sample size needed for your minimum detectable effect at your chosen significance level and power. Randomize users into control and treatment, then check for balance on key covariates. Run the test for the pre-determined duration; don't peek and stop early. Analyze with the pre-registered test. Report the effect size and confidence interval, not just the p-value.
Machine learning interview questions
Expect both concept checks and scenario traps. The best answers show reasoning, not just the right label.
Explain the bias variance tradeoff. What would you do if your model has low bias and high variance?
Low bias means the model captures the underlying pattern well. High variance means it's too sensitive to the training data — it overfits. To reduce variance without increasing bias much, use bagging (e.g., Random Forest), regularization (L1/L2), or increase training data. Ensemble methods average out the noise from individual models, which directly targets variance.
Supervised vs. unsupervised learning — when would you use each?
Supervised learning requires labeled data and predicts a known target (classification, regression). Unsupervised learning finds structure without labels (clustering, dimensionality reduction). Use supervised when you have a clear outcome to predict — churn, fraud, conversion. Use unsupervised when you're exploring — customer segmentation, anomaly detection where you don't have labeled anomalies, or reducing feature dimensionality before feeding into a supervised model.
Why is accuracy a misleading metric for imbalanced classification?
If 96% of your cancer detection dataset is negative, a model that predicts "no cancer" every time achieves 96% accuracy and catches zero actual cases. Use precision, recall, F1, or AUC-ROC instead. Which one depends on the cost of false positives vs. false negatives — in cancer screening, recall (catching true positives) matters more than precision.
Is rotation necessary in PCA? What happens if you skip it?
PCA extracts principal components that are orthogonal by definition, so rotation isn't strictly necessary for dimensionality reduction. But if you want interpretable components — where each one loads heavily on a few original features — rotation (like varimax) helps. Without rotation, components can load on many features simultaneously, making them harder to name or explain to stakeholders.
Why is Naive Bayes called "naive"?
It assumes all features are conditionally independent given the class label. In reality, features are almost always correlated. Despite this, Naive Bayes often performs surprisingly well in text classification and spam filtering because the independence assumption, while wrong, still produces useful posterior rankings. It's "naive" about the data's structure but pragmatic in practice.
You have 1,000 features and 1 million rows and your machine is running out of memory. Walk me through your approach.
Start with dimensionality reduction: drop highly correlated features, apply PCA, or use feature importance from a quick tree-based model to select the top N features. If the data still doesn't fit, sample rows for initial exploration — train on a subset, validate on the full set. Consider sparse representations if many features are zero-heavy. Use chunked processing (pandas `chunksize`, Dask, or Spark) if you need all rows. The goal is to reduce the problem to something that fits in memory without losing the signal that matters.
When is ML the wrong tool?
When the problem has a known, deterministic solution. Route optimization, for example, is a well-studied operations research problem — algorithms like Dijkstra's or linear programming solve it more reliably and interpretably than a neural network. ML is the right tool when the relationship between inputs and outputs is complex, non-linear, and hard to specify with rules. If you can write the rules, write the rules.
What's the difference between bagging and boosting?
Bagging trains multiple models in parallel on random subsets of the data and averages their predictions — it reduces variance (Random Forest). Boosting trains models sequentially, where each new model focuses on the errors of the previous one — it reduces bias (XGBoost, AdaBoost). Bagging is more robust to overfitting. Boosting can achieve lower error but is more sensitive to noisy data and hyperparameter tuning.
SQL and Python interview questions
These are tested even for non-engineering DS roles. If you can't query and clean data, nothing else matters.
Write a query to find the second highest salary in a table.
Use a window function: `DENSE_RANK() OVER (ORDER BY salary DESC)` and filter for rank = 2. Or use a subquery: `SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees)`. The window function approach is cleaner and extends to Nth-highest without rewriting.
Explain the difference between JOIN types with a practical scenario.
INNER JOIN returns only matching rows from both tables — use it when you need users who have both a profile and at least one order. LEFT JOIN returns all rows from the left table and matched rows from the right — use it when you want all users, including those with zero orders (NULLs for order columns). FULL OUTER JOIN returns all rows from both — useful for reconciliation. CROSS JOIN returns every combination — rarely what you want, but useful for generating date-user grids.
You have a pandas DataFrame with duplicates and nulls — what's your cleaning sequence?
First, understand the data: `.info()`, `.describe()`, `.isnull().sum()`. Drop exact duplicate rows with `.drop_duplicates()`. For nulls, decide per column: drop if the column is non-essential and >50% null, impute with median/mode if the missingness is random, or flag with an indicator column if the missingness itself is informative. Then validate: check row counts, distributions, and key relationships haven't broken.
How would you investigate a sudden 20% drop in a key metric?
Segment first. Is the drop global or isolated to a specific platform, region, user cohort, or traffic source? Check for data pipeline issues — did a logging change or ETL failure cause missing data? If the data is clean, look at recent product changes, deployments, or external events. Compare the affected period to the same period last week and last year. Quantify the contribution of each segment to the total drop. Present the root cause with evidence, not speculation.
What's the difference between groupby + agg and a window function in SQL?
`GROUP BY` collapses rows — you get one row per group. A window function computes a value across a set of rows but keeps every row in the result. Use `GROUP BY` when you want summary statistics (total revenue per region). Use a window function when you want a running total, a rank, or a comparison to the group average while preserving row-level detail.
Behavioral and scenario based data science interview questions
Behavioral rounds are not soft. They're signal for how you handle ambiguity, failure, and stakeholder conflict.
Use the STAR framework: Situation (set the scene in one sentence), Task (what was your responsibility), Action (what you specifically did), Result (quantified outcome). Every project story should have a number in the Result.
Tell me about a model you built that didn't perform as expected. What did you do?
A strong answer names the model, the metric it fell short on, and the specific debugging steps: checked data quality, examined feature distributions, tested for leakage, tried alternative architectures. The result should include what you learned and how you applied it next time, not just "I fixed it."
Describe a time you had to explain a complex result to a non technical audience.
Focus on how you translated — what analogy or visualization you used, what you left out, and how you confirmed understanding. The best answers show you adjusted your communication based on the audience's reaction, not just that you simplified.
You disagree with a stakeholder about which metric to optimize. How do you handle it?
Show that you listened first, then used data to frame the trade-off. "I showed them that optimizing for click-through rate increased clicks by 12% but reduced conversion by 8%, which meant net revenue dropped. We agreed on a composite metric that balanced both." Disagreement is fine. Unresolved disagreement without data is not.
Walk me through a project end to end — from problem framing to deployment.
Use STAR. A strong example: "We built a churn prediction model (Situation). My task was to own the pipeline from feature engineering through deployment. I tested logistic regression and gradient boosting, selected the model with 92% accuracy and 20% fewer false positives (Action). The result was 15% lower churn and an estimated $1M in retained revenue over six months (Result)." Include what you'd do differently next time.
How do you stay current with new tools and techniques?
Name specific sources: papers you've read recently, a tool you adopted, a course you completed. "I stay current" is not an answer. "I implemented a feature store using Feast after reading about it in a case study from Uber's ML platform team" is an answer.
Tell me about a time you pushed back on a data request. What was the outcome?
Show judgment. "A product manager asked for a dashboard tracking 40 metrics. I proposed we focus on the five that directly tied to the team's OKRs and build the rest on request. We shipped in two days instead of two weeks, and the team actually used it."
Advanced and modern tooling questions (experienced candidates)
Senior loops increasingly include MLOps, cloud, and generative AI topics. If you're interviewing at the senior or staff level, expect these.
How do you monitor a model in production for drift?
Track input feature distributions (data drift) and prediction distributions (concept drift) over time. Use statistical tests like PSI (Population Stability Index) or KS tests on key features. Set up alerts when drift exceeds a threshold. Retrain on a schedule or trigger retraining when performance degrades on a holdout set that refreshes with new labeled data.
What's your approach to feature stores and reproducible pipelines?
A feature store centralizes feature computation so training and serving use the same logic — no training/serving skew. Tools like Feast or Tecton handle this. Reproducible pipelines mean version-controlling data, code, and model artifacts together. If you can't re-run a training job from six months ago and get the same model, your pipeline isn't reproducible.
How have you used or evaluated LLMs or generative AI in a data science workflow?
Be specific. "I used an LLM to generate synthetic training data for a low-resource text classification task and validated that model performance improved by X% on the held-out set." Or: "I evaluated GPT-4 for automated feature description generation in our data catalog and found it reduced manual documentation time by 60% but required human review for accuracy." Vague enthusiasm about AI is not an answer.
Walk me through how you'd set up an experiment in a low traffic environment where classical A/B testing won't work.
Consider Bayesian methods with informative priors, multi-armed bandits for adaptive allocation, or synthetic control methods. If traffic is truly too low for any statistical test, run a pre/post analysis with a matched control group and be transparent about the limitations. The key is showing you know when standard A/B testing assumptions break and what alternatives exist.
What cloud services have you used for model training and deployment? What trade offs did you encounter?
Name specific services: SageMaker, Vertex AI, Databricks, or equivalent. Discuss trade-offs honestly — managed services reduce infrastructure overhead but limit customization; spot instances cut cost but introduce interruption risk; serverless inference scales to zero but has cold-start latency. Interviewers want to see that you've made real decisions, not that you've read the documentation.
Fresher vs. experienced — what changes in the interview
If you're interviewing for your first DS role
Interviewers expect fundamentals, not production war stories. Focus on portfolio projects with clear problem statements, clean code, and quantified results. Show you can reason through a statistics problem on a whiteboard. A data science candidate may go through three to five interviews — use each one to get sharper. The biggest differentiator at this level is showing you can learn fast and explain your thinking clearly.
If you have 3+ years of experience
Expect system design, trade-off discussions, and questions about owning outcomes, not just building models. "I trained a model" is junior. "I identified the business problem, scoped the solution, built the pipeline, deployed it, monitored it, and it saved $X" is senior. Hiring managers care more about adaptability than mastery of any single tool — the candidate who switched from Tableau to Looker without complaint and delivered on time signals more than the one who insists on their preferred stack.
How to practice data science interview questions
- Mock interviews — Practice out loud, not just in your head. Reading an answer silently and delivering it under pressure are completely different skills. Verve AI's Interview Copilot lets you run mock data science interviews and gives you real-time feedback on your answers — so you can tighten your delivery before the real thing.
- Framework over memorization — Learn the reasoning pattern behind each question type. If you understand why accuracy fails on imbalanced data, you can answer any variant of that question. If you memorized "use F1 score," you'll freeze when the interviewer asks "why not AUC-ROC?"
- Use STAR for every project story — Even in technical rounds, structured storytelling lands better than a stream of consciousness. Situation, Task, Action, Result. Include a number in the Result.
The 30 questions above cover the majority of what you'll face in a data science interview loop in 2026. The goal isn't to memorize answers — it's to understand what each question is testing and respond with evidence. Practice out loud, build your STAR stories around real metrics, and treat every mock session as a chance to tighten your reasoning.
Verve AI
Archive
