Data engineer interview questions, organized by seniority and round type, with answer frameworks for SQL, Spark, ETL, modeling, system design, and the tradeoffs
Most candidates preparing for a data engineer role treat the process like a vocabulary test. They memorize data engineer interview questions the way you'd study flashcards — question on one side, answer on the other — then walk into a phone screen expecting to recite definitions and walk out with a next-round invite. What actually happens is the interviewer asks a follow-up, the candidate's answer collapses, and the feedback says "weak fundamentals" when the real problem was that the candidate prepared for the wrong version of the test.
The questions themselves are not the hard part. The hard part is that the same question — "how does partitioning work in Spark?" — tests completely different things depending on whether the interviewer is screening a new grad, probing a mid-level candidate's production judgment, or pressure-testing a senior engineer's architecture instincts. Preparing without knowing which version of the question you're about to face is how smart people get filtered out in rounds they should have cleared.
This playbook maps the full landscape: which questions show up in which rounds, what seniority signals interviewers are actually watching for, how to structure project walkthroughs and tradeoff answers, and where follow-up probes expose the difference between someone who read about a concept and someone who's operated it under pressure.
Data engineer interview questions change fast once seniority enters the room
The same pipeline question lands differently depending on who's asking it and what they're trying to confirm. Understanding that shift is the first thing that separates candidates who prep strategically from candidates who just prep more.
What do entry-level questions really test?
Junior rounds are mostly checking whether you can reason cleanly through fundamentals without pretending to know everything. The interviewer is not expecting you to have designed a production-grade streaming system. They want to see whether you can write a correct SQL join, explain why a GROUP BY behaves the way it does, and describe the difference between a fact table and a dimension table without sounding like you're reading from a textbook.
What catches entry-level candidates off guard is not the difficulty of the questions — it's the follow-up. An interviewer who hears "I'd use an index on that column" immediately wants to know: what kind of index, and why? Candidates who've memorized the answer without understanding the mechanics give a confident first answer and then go silent. The honest response — "I know a B-tree index is common here, but I'd want to check the query pattern before committing to that" — is almost always better received than a confident wrong second answer.
Cleanliness of thought matters more than depth at this level. Interviewers are checking whether you can hold a concept together under light pressure, not whether you've seen every edge case.
What do mid-level questions usually assume you already know?
By mid-level, the interviewer expects you to connect tools and decisions without narrating every basic definition. If you're explaining an ETL pipeline and you stop to define what ETL stands for, you've already signaled something. The question isn't "what is ETL?" — the question is "given these constraints, which transformation approach would you use and why?"
Mid-level rounds tend to focus on the seams between components: how your SQL query interacts with the warehouse optimizer, how your Spark job handles skewed data, how your schema choices affect downstream query performance. Candidates who've worked in production environments answer these questions with specifics — actual tool names, actual numbers, actual failure modes they've encountered. Candidates who've only studied in theory answer them with correct but generic statements that don't survive a single follow-up.
What makes senior answers sound senior instead of just longer?
The difference is not vocabulary. A senior answer shows tradeoffs, failure modes, and operating constraints — and it does this without being asked.
Consider a pipeline redesign question. A junior candidate describes what they'd build. A mid-level candidate describes what they'd build and why. A senior candidate describes what they'd build, what would break first, how they'd know it broke, what the recovery path looks like, and what they'd have done differently if the constraint had been cost instead of latency. That's not a longer answer — it's a different kind of thinking.
When a senior engineer talks about an incident response, they don't start with "we had a problem." They start with the system state that made the problem possible, the detection gap that let it run for two hours before anyone noticed, the fix, and the architectural change that made the same failure mode impossible the next time. That's what "senior" sounds like — not better vocabulary, but ownership of the full operating lifecycle.
What hiring managers are really probing in each interview round
Interview rounds are not just progressive difficulty levels. Each round is looking for a different type of signal, and understanding that changes how you prepare.
Why the phone screen is usually a filter, not a deep dive
Phone screens exist to confirm that the candidate can communicate coherently, knows the basic vocabulary of data engineering, and won't be a surprise in the technical round. Interviewers at this stage are checking breadth: can you describe the difference between OLAP and OLTP without stumbling? Can you explain what a data warehouse does without defaulting to marketing language? Can you talk about a project you've worked on without losing the thread?
The trap here is theory theater — candidates who've studied hard and want to prove it by front-loading every caveat and edge case into a screen question. Screeners are not impressed by comprehensiveness at this stage. They're looking for signal clarity. Short, accurate answers that invite follow-up perform better than exhaustive monologues.
Why technical rounds keep circling back to one thing: can you reason under pressure?
Technical rounds use follow-ups as the actual instrument. The first answer tells the interviewer whether you know the concept. The follow-up tells them whether you understand it. A candidate who answers "I'd use a hash join here" and then explains why — what the optimizer is doing, when a hash join becomes expensive, what they'd check in the execution plan — is demonstrating something qualitatively different from a candidate who just names the join type.
The scenarios that expose the most gaps are SQL optimization, schema choices, and pipeline failure handling. These topics have clean textbook answers and messy production realities. Interviewers who've worked in production know the gap, and they probe it deliberately. "What happens if the table statistics are stale?" "What if the upstream feed arrives four hours late?" "What does your pipeline do when the schema changes without warning?" These are not gotcha questions — they're the actual job.
Why the onsite or final loop is where system thinking takes over
Final rounds care less about isolated facts and more about whether you can design something that survives scale, cost, and messy data. System design questions at this stage are not looking for the perfect answer — they're looking for how you navigate tradeoffs under constraints.
Candidates who've worked on real systems bring constraints into their answers naturally: "at that volume, the cost of full reprocessing becomes significant, so I'd want an incremental pattern with checkpointing." Candidates who haven't tend to design for the happy path and go quiet when the interviewer introduces a wrinkle. The wrinkle is always coming.
The table stakes are not optional: SQL, Spark, ETL, modeling, and warehouses
These topics appear in nearly every data engineer interview at every level. The question is not whether they'll come up — it's whether your answers show understanding or just familiarity.
What SQL optimization questions are really asking
SQL optimization questions are not asking you to recite index types. They're checking whether you can make a query cheaper without guessing. The interviewer wants to see a mental model: you look at the query shape, you think about what the optimizer is likely to do, you identify where the cost is coming from, and you propose a targeted change.
The concepts that come up most: join order and selectivity, index types and when they help versus hurt, execution plan reading, predicate pushdown, and the difference between a query that's logically correct and one that's operationally viable at scale. Candidates who've run `EXPLAIN ANALYZE` on a slow query in production have a distinct advantage here — they've seen what the optimizer actually does versus what they expected it to do. Resources like the PostgreSQL documentation on query planning are worth understanding at a mechanical level, not just as reference material.
Why Spark questions expose shallow understanding fast
The failure case that separates Spark users from Spark practitioners is data skew. A candidate who knows Spark syntax can describe transformations. A candidate who understands Spark can explain what happens when 80% of your data maps to one partition key — the executor handling that partition runs out of memory, the job either fails or crawls, and the fix is not obvious unless you understand how shuffles work.
OOM errors, broadcast join thresholds, partition count tuning, and the difference between narrow and wide transformations are the questions that expose shallow understanding fast. If you've only used Spark through a notebook interface without ever looking at the Spark UI, you'll struggle to answer "why is this job taking four times longer than expected?" with anything specific. The Apache Spark documentation on performance tuning is not exciting reading, but the candidates who've internalized it answer follow-ups with specifics instead of generalities.
How data modeling and warehouse questions separate analysts from engineers
The difference between describing a table and designing a model shows up immediately when an interviewer asks about a schema that needs to handle historical changes. An analyst describes what the data looks like today. An engineer designs for what the data will look like when a customer changes their address, a product gets reclassified, or a business rule shifts retroactively.
Slowly changing dimensions, surrogate keys, grain definition, and the tradeoffs between star schema and normalized models are the concepts that separate the two. The practical question underneath all of it is: when this data gets bigger, messier, and harder to query, will this model hold? Candidates who've watched a poorly designed schema collapse under reporting load answer this question with conviction. Candidates who've only designed schemas in theory tend to describe the ideal case.
Data engineer interview questions about project walkthroughs fall apart without a real structure
Project walkthroughs are the most underestimated part of a data engineer interview. Candidates spend most of their prep time on technical questions and then improvise the walkthrough — and it shows.
What should a strong project walkthrough answer actually contain?
The structure that works is STAR-plus-impact: context (what was the system, what was the constraint), your specific role (not "we built," but "I was responsible for"), the messy constraint (the thing that made this hard), the decision you made and why, and the quantified result. Not a project tour. Not a feature list. A narrative with a decision at its center.
"We built a pipeline to ingest 40 million events per day from three upstream sources with inconsistent schemas. I owned the transformation layer. The constraint was that the downstream team needed data available by 6 AM for their reporting jobs, and the upstream feeds were arriving late about 20% of the time. I implemented a partial-load pattern with quality checks that allowed the downstream jobs to run on available data while flagging incomplete partitions for reprocessing. Latency SLA compliance went from 78% to 96% over the following quarter." That answer has a decision, a constraint, and a number. It invites follow-up. It sounds like something that actually happened.
How do you keep a walkthrough from sounding rehearsed?
The best project answers sound lived-in because they include one tradeoff, one mistake, and one thing the candidate would do differently. Rehearsed answers describe what worked. Real answers describe what you chose not to do and why, and what you'd change with hindsight.
"I chose Airflow for orchestration because the team already had it running, but in retrospect the dependency graph got complicated enough that I'd probably evaluate Prefect or Dagster for a greenfield version of the same system — the dynamic task mapping would have simplified a lot of the retry logic." That sentence is worth more than three minutes of describing what the pipeline did. It shows judgment, not just execution.
What does a strong project answer look like when the interviewer pushes on details?
The follow-up probe trail usually goes: why that ETL tool, why that schema, why that warehouse, why that orchestration choice. The candidates who handle this well have a reason for each decision that connects to a constraint — cost, team skill, latency requirement, data volume, existing infrastructure. The candidates who struggle give answers that sound like "it was the standard choice" or "that's what the team used."
You don't need a perfect decision to give a strong answer. You need a real reason. "We used Redshift because the rest of the analytics infrastructure was already there and the migration cost wasn't justified for this project's scope" is a better answer than "Redshift is a great warehouse for analytical workloads" — even though the second statement is also true.
The tradeoff questions are where good candidates stop and senior ones start talking
Tradeoff questions are the point in an interview where preparation diverges most visibly from experience. Candidates who've studied know the definitions. Candidates who've operated systems know which choice costs you more at 3 AM.
ETL vs ELT: which answer sounds thoughtful instead of memorized?
The thoughtful answer depends on where the transformations live, who owns them, and how much raw data you need to preserve. ETL made sense when compute was expensive and storage was cheap — you transformed before loading because you couldn't afford to store the raw mess. ELT makes sense in modern cloud warehouses where compute scales horizontally and you want to preserve raw data for reprocessing when business logic changes.
The real separator is the ownership question: if the transformation logic lives in the warehouse, the analytics engineers own it. If it lives upstream, the data engineers own it. That's not a technical question — it's an organizational one, and the best answers acknowledge it. A warehouse migration example makes this concrete: "we moved from ETL to ELT when we migrated to BigQuery because we wanted raw data available for reprocessing when our attribution model changed, and the dbt layer gave analytics engineers direct ownership of business logic without going through a pipeline deployment."
Warehouse vs lakehouse vs operational database: what are they actually optimizing for?
The structural difference is between analytical and operational access patterns. Operational databases optimize for transactional throughput and point lookups. Warehouses optimize for analytical queries across large datasets with known schemas. Lakehouses try to combine the schema flexibility of a data lake with the query performance of a warehouse — with real tradeoffs in both directions.
The wrong storage choice shows up in reporting latency. A team that runs analytical queries against an operational database will eventually hit contention issues — the reporting jobs compete with transactional writes, and neither performs well. A team that tries to use a lakehouse for low-latency operational queries will find the query engine overhead painful. The answer to "which one would you use?" is always grounded in the access pattern, the team's tooling, and the latency requirement — not in which option sounds most modern.
Batch vs streaming: why the right answer is usually 'it depends' — but better
"It depends" is only a good answer if you can immediately say what it depends on. The real variables are freshness requirement, failure recovery cost, and business urgency. Batch is cheaper to operate, easier to debug, and simpler to reprocess when something goes wrong. Streaming is necessary when the latency between event and action matters — fraud detection, real-time inventory, live dashboards.
The late-arriving-data problem is where streaming answers get interesting. If your fraud alert system uses a 5-minute event window and a transaction arrives 8 minutes late, what happens? Watermarking handles some of this, but the answer requires knowing what "exactly-once" means in practice versus in theory, and what you're willing to accept in terms of late-event handling. Candidates who've built streaming systems have opinions here. Candidates who've only studied them describe the architecture without engaging with the failure modes.
The follow-up questions are where interviewers check if your first answer was real
Follow-ups are not harder versions of the original question. They're a different instrument entirely — designed to check whether the answer you just gave was based on understanding or pattern-matching.
What follow-up usually comes after a correct SQL answer?
Once the base answer is right, the follow-up almost always goes to performance, edge cases, or scale behavior. "Great — now what happens if that table has 500 million rows and the statistics haven't been updated in a week?" or "How does that query behave if the join key has a high null rate?" These questions have no clean textbook answer. They require the candidate to reason through what the optimizer is likely to do, where the cost accumulates, and what they'd check first.
Candidates who've read execution plans answer these with specifics. Candidates who haven't tend to give general statements about indexes and query structure that don't engage with the actual scenario.
What follow-up usually comes after a pipeline design answer?
The standard probe trail after a pipeline design answer covers: retry logic and idempotency, behavior when upstream data arrives late, schema drift handling, and monitoring. These are not edge cases — they're the normal operating conditions of any production pipeline. An interviewer who hears a clean pipeline diagram immediately wants to know what happens when the upstream feed sends a duplicate record, or when a column gets renamed without warning.
The candidates who answer this well have a specific answer for each probe: "idempotency is handled by writing to a staging table and doing an upsert on the primary key before promoting to the final table" is a real answer. "We'd handle duplicates in the transformation layer" is not.
What follow-up usually exposes bluffing in Spark or modeling questions?
Skewed partitions and broken dimension tables are the two scenarios that most reliably expose shallow understanding. If a candidate says they'd "repartition the data" to fix a skew problem, the follow-up is: repartition on what key, and why would that help? If the answer is "the primary key," the interviewer knows the candidate doesn't understand why skew happens in the first place.
For modeling questions, the broken dimension table scenario — a customer record that changes mid-period and now makes historical reporting inconsistent — forces the candidate to engage with SCD design at a mechanical level. Candidates who've dealt with this in production have a specific answer about type 2 SCDs, surrogate keys, and effective date ranges. Candidates who haven't tend to describe the problem without proposing a solution.
Behavioral data engineer interview questions are really about collaboration under friction
Behavioral questions in data engineering interviews are not really about feelings. They're about how you operate when the technical problem intersects with an organizational constraint.
How do you answer conflict questions without sounding passive?
The best conflict answers show judgment without turning into a hero story. A real example: disagreement over schema ownership between the data engineering team and the analytics team, where both teams had legitimate claims on the transformation logic. The strong answer describes the constraint, the competing priorities, the conversation that happened, and the outcome — including what you conceded and why.
What interviewers are watching for is whether the candidate can hold a position under pressure, update it when the evidence changes, and separate the technical argument from the interpersonal one. Passive answers ("I just deferred to the team lead") signal low agency. Hero answers ("I convinced everyone I was right") signal low self-awareness. The answer that lands shows a real disagreement, a real process, and a real resolution that involved tradeoffs on both sides.
How do you talk about a hard incident without turning it into a confession?
The failure answer has a specific structure: what the system state was before the incident, how the failure was detected, what the diagnosis process looked like, what the fix was, and what changed architecturally afterward. The goal is accountability without catastrophizing, and learning without over-explaining.
"A pipeline I owned dropped a partition silently for three days before a downstream analyst caught it in a dashboard discrepancy. The monitoring wasn't checking for partition completeness — only for job success. We added row-count validation and partition freshness checks to the alerting layer, and I wrote a postmortem that became the template for the team's incident process." That answer shows ownership, a specific failure mode, and a concrete improvement. It doesn't sound like a confession because it ends with what changed.
How do you answer decision-making questions when there was no clean answer?
The best answers to ambiguous decision questions make room for imperfection. The interviewer is not looking for a case where everything worked out perfectly — they're looking for evidence that you can reason through a constraint, consider multiple options, and commit to a path even when the information is incomplete.
"We had to choose between a six-week refactor that would have solved the root problem or a two-week patch that would hold for the next quarter. The business needed the reporting capability before the board presentation. I chose the patch, documented the technical debt, and scheduled the refactor for the following sprint cycle. It wasn't the ideal technical decision, but it was the right operational one given the constraint." That answer is honest about the tradeoff and confident about the reasoning.
Senior data engineer interview questions go straight to system design and failure mode thinking
Senior rounds assume you can answer the fundamentals. What they're testing is whether you can own a system end to end — including the parts that break, drift, and surprise you.
How would you design a batch pipeline that can survive bad data and reprocessing?
The diagram is the easy part. The hard part is the operational layer: orchestration, backfills, idempotency, data quality checks, and what happens when upstream data changes retroactively.
A strong answer covers: how the pipeline handles duplicate records (idempotent writes, staging tables, upsert patterns), how it handles late-arriving data (configurable lookback windows, partition-level reprocessing), what triggers a backfill and how long it takes, and what data quality checks run before data is promoted to the final layer. The upstream schema change scenario is where most candidates go quiet — a strong answer has a specific strategy for schema drift detection and a process for communicating breaking changes to downstream consumers before they happen.
How would you design a streaming system without pretending it never breaks?
Streaming design questions at the senior level are really about failure recovery. State management, exactly-once semantics, watermarking, and late-arriving event handling are the concepts that matter — not the choice of Kafka versus Kinesis.
A concrete event-driven use case makes the answer real: a fraud detection system that needs to evaluate transaction patterns within a 10-minute window, where late-arriving events are common due to mobile network delays. The answer has to address what happens when a late event arrives after the window has closed, how the system recovers from a consumer failure mid-window, and how you'd alert on processing lag before it becomes a customer-facing issue. The Apache Flink documentation on event time and watermarks is one of the cleaner technical references for understanding how streaming systems actually handle time.
How do you explain observability, lineage, and schema evolution like an owner?
Senior answers connect monitoring and lineage to actual on-call pain — not abstract governance. The broken downstream report is the example that makes this concrete: a dashboard that started showing incorrect revenue numbers because a column in an upstream table was renamed three weeks ago, the transformation layer silently continued running, and nobody noticed until a business stakeholder flagged the discrepancy in a quarterly review.
An owner's answer describes the monitoring that would have caught this: column-level lineage tracking that alerts when a source schema changes, row-count validation between pipeline stages, and a data contract between the upstream team and the pipeline that requires explicit versioning for breaking changes. Schema evolution tooling like dbt's model versioning is worth understanding at a practical level — it's the kind of reference that comes up naturally in senior system design conversations.
How Verve AI Can Help You Prepare for Your Data Engineer Job Interview
The hardest part of data engineer interview prep isn't finding more questions — it's practicing the follow-up chains that turn a correct first answer into a senior-grade conversation. Most practice tools give you a question and a sample answer. What they can't do is respond to what you actually said, probe the specific part you glossed over, and push back when your pipeline design skips the failure recovery story.
That's the structural problem Verve AI Interview Copilot is built to solve. It listens in real-time to your answer — not a canned prompt — and responds to what you actually said. If you described a batch pipeline without mentioning idempotency, Verve AI Interview Copilot asks about idempotency. If your tradeoff answer landed on "it depends" without explaining what it depends on, it probes that gap directly. The practice sessions feel like the real follow-up chains that expose shallow answers in actual interviews, because the tool is reacting to your specific response rather than running a fixed script.
For data engineering specifically, Verve AI Interview Copilot covers the full loop: SQL optimization probes, Spark failure scenarios, system design walkthroughs, behavioral questions about incidents and conflict, and the architecture defense conversations that show up in senior rounds. It stays invisible during live sessions at the OS level, so candidates who use it for real-time support during remote interviews don't have to worry about detection. The practice mode alone — running mock follow-up chains on your own project walkthroughs — is where most candidates find the gaps they didn't know they had.
Conclusion
Data engineer interview questions only look like a list until you map them to level, round, and tradeoff depth. The same question about partitioning tests vocabulary in a phone screen, production judgment in a technical round, and architectural ownership in a senior loop. Preparing for all three versions of every question is what separates candidates who clear every round from candidates who clear the first one and stall.
The move is not to memorize more questions. It's to build your own study map: which topics are table stakes at your target level, which rounds will probe tradeoffs versus fundamentals, and where your current answers would collapse under a single follow-up. Start there, and the question list becomes a tool instead of a target.
Quinn Okafor
Interview Guidance

