Interview questions

PySpark Interview Questions: 30 Answers, Follow-Up Traps, and Simulation Notes

April 30, 2026Updated May 5, 202620 min read
pexels yankrukov 7693241

Master PySpark interview questions with 30 model answers, follow-up traps, and real workload examples to rehearse the full exchange under pressure.

Knowing PySpark concepts cold is not the same as explaining them cleanly under pressure. These PySpark interview questions are the ones that separate candidates who've read the docs from candidates who've actually shipped data — and the difference almost always shows up in the follow-up, not the first answer. This article pairs every common question with a model answer, the follow-up trap interviewers actually use, and a concrete example from a real workload so you can rehearse the full exchange, not just the opening line.

The pattern that trips up mid-level candidates specifically is this: they know the right answer but give the textbook version when the interviewer wants the production version. "Lazy evaluation means transformations don't execute until an action is called" is technically correct and completely forgettable. What interviewers remember is the candidate who immediately followed that with "which is why a chain of five filter and select calls costs almost nothing until you hit a write or a count — Spark collapses them into one optimized plan before touching the data." That's the level this guide is written for.

Start Here: The PySpark Questions Everyone Gets Asked First

These are the questions that open almost every PySpark interview, and they're dangerous precisely because they feel easy. Candidates rush through them, give a definition, and move on — then get stuck when the interviewer says "okay, but when would you actually pick one over the other?"

What is PySpark, and when should you choose it over pandas or pure Python?

PySpark is the Python API for Apache Spark, a distributed processing engine that runs computations across a cluster of machines. The clean answer interviewers want: use PySpark when your data doesn't fit comfortably in memory on a single machine, when you need fault tolerance across long-running jobs, or when you're already operating inside a Spark ecosystem like Databricks or EMR.

The boundary that matters: pandas is faster and simpler for anything under a few gigabytes that fits in local memory. A notebook job that reads a 500MB CSV, cleans it, and writes a summary is a pandas job. A daily ingestion pipeline that processes 200GB of event logs partitioned across S3, joins them against a 50GB fact table, and writes results back to a data warehouse is a PySpark job. The follow-up interviewers use here is "what's the cost of choosing PySpark when you didn't need to?" — the answer they want is overhead: cluster spin-up time, serialization cost, and the complexity of distributed debugging for a job that a pandas script would finish in 30 seconds.

How do you explain SparkSession, the driver, executors, and the DAG in one concise answer?

The 30-second version: SparkSession is your entry point — it's the object you create to talk to the Spark cluster. The driver is the process that runs your application code, builds the execution plan, and coordinates work. Executors are the worker processes on each node that actually run the tasks and store data in memory or on disk. The DAG — directed acyclic graph — is the logical plan Spark builds from your transformations before it runs anything, which is what makes optimization possible.

The follow-up trap: "Who decides how many tasks to run per stage?" The answer is the driver, based on the number of partitions in the data at that stage. Interviewers use this to check whether you understand that the DAG is a plan, not a guarantee — the actual parallelism is driven by partition count, not by how many executors you have.

What is lazy evaluation, and how do transformations differ from actions?

Transformations — `filter()`, `select()`, `groupBy()`, `join()` — are instructions that build the DAG. They don't touch the data. Actions — `count()`, `collect()`, `write()`, `show()` — trigger execution. Spark waits until an action is called, then looks at the entire chain of transformations and uses the Catalyst optimizer to rewrite and optimize the plan before running a single task.

The ETL example that makes this concrete: you have a pipeline that reads a 100GB log file, filters to errors only, selects three columns, and renames one of them. Those four transformation calls execute in microseconds — they're just building a plan. The moment you call `.write()`, Spark compiles the whole chain, pushes the filter as early as possible, and only then touches the data. Without lazy evaluation, each step would force a full pass over the data. With it, Spark often collapses multiple steps into one physical scan.

When should you use DataFrames versus RDDs in PySpark, and what tradeoffs matter most?

DataFrames win in almost every production scenario because they sit on top of Spark SQL, which means Catalyst can optimize them. They have schema awareness, work natively with structured data, and integrate cleanly with Parquet, Delta, and SQL queries. RDDs are the lower-level API — they give you full control over how data is structured and processed, but you lose Catalyst optimization and you're responsible for serialization yourself.

The steelman for RDDs: if you're working with unstructured data where a schema genuinely doesn't apply — say, a custom binary format or a graph computation — RDDs give you the flexibility DataFrames can't. A text-cleaning pipeline that processes raw HTML and applies a custom tokenizer in Python is a legitimate RDD use case. The follow-up interviewers use: "What happens to performance when you use Python UDFs on a DataFrame?" The trap is that Python UDFs break out of the JVM, which means Spark can't optimize them with Catalyst and you pay serialization cost on every row. That's the moment RDDs or Pandas UDFs might be worth reconsidering.

Answer Spark Performance Questions Without Hand-Waving

PySpark performance tuning questions are where interviews separate candidates who've read about Spark from candidates who've actually watched a job die in production. The questions below are the ones that test whether you understand the system or just the syntax.

How do repartition() and coalesce() differ, and when should each be used?

`repartition(n)` triggers a full shuffle — it redistributes data evenly across n partitions regardless of where it currently lives. `coalesce(n)` reduces partition count without a full shuffle by combining existing partitions on the same node. The practical rule: use `coalesce()` after a heavy filter that leaves you with a lot of nearly-empty partitions, because it's cheaper. Use `repartition()` when you need to increase partition count or when you need even distribution before a join or aggregation.

The follow-up trap: "If you coalesce down to one partition, what happens?" The answer is that all data gets pulled to one executor — you've effectively serialized a distributed job and will likely OOM or create a massive bottleneck. Interviewers use this to check whether you understand that coalesce is a performance tool, not a cleanup step you apply blindly at the end of a job.

How do joins, broadcast joins, and caching improve performance in real workloads?

Standard joins require a shuffle — both sides of the join get redistributed across the cluster so matching keys end up on the same executor. Broadcast joins avoid the shuffle entirely by sending a copy of the smaller table to every executor. The canonical use case: a 500GB fact table joined against a 5MB dimension table. Broadcasting the dimension table means the fact table never moves.

The bad habit interviewers look for: broadcasting tables that are too large. The default broadcast threshold in Spark is 10MB, and pushing it higher with `spark.sql.autoBroadcastJoinThreshold` works until the table is large enough to fill executor memory. Caching — `df.cache()` or `df.persist()` — only helps when you're reusing the same DataFrame multiple times in the same job. If you cache a DataFrame that's only read once, you've paid the memory cost for nothing. The follow-up: "When would you unpersist a cached DataFrame?" — the answer is as soon as you know you won't need it again, to free executor memory for the next stage.

What does the Catalyst optimizer actually do for you?

Catalyst is Spark's query optimizer for DataFrames and Spark SQL. It takes your logical plan — the chain of transformations you wrote — and rewrites it into a more efficient physical plan before execution. In practice, this means Catalyst will push filters earlier in the plan, eliminate columns you don't need before a join, and choose join strategies based on table statistics.

The concrete example: write a query that joins two tables and then filters the result. Catalyst will often push that filter before the join so fewer rows participate in the expensive shuffle. You didn't write that optimization — Catalyst inferred it. The follow-up: "Does Catalyst help with Python UDFs?" No — Python UDFs are a black box to Catalyst. It can't rewrite or optimize them. This is why minimizing Python UDFs in hot paths matters for production jobs.

How do you handle data skew when one partition gets hammered?

Data skew happens when one key — often a popular customer ID, a null value, or a high-cardinality outlier — ends up concentrating a disproportionate share of rows in one partition. The symptom in Spark UI: one task takes 10 minutes while the other 199 finish in 30 seconds. The stage never finishes because it's waiting for the slow partition.

The fix options, in order of complexity: broadcast the smaller side of the join if it fits in memory; use Adaptive Query Execution (AQE), which Spark 3.x enables by default and which can split skewed partitions automatically; or salt the skewed key by appending a random suffix to distribute it across multiple partitions, then aggregate the results afterward. The follow-up: "What's the downside of salting?" It adds complexity to your aggregation logic and makes the join key non-deterministic, so you have to strip the salt before any downstream join on that key.

Handle Nulls, Schemas, and Pipeline Scenarios Like Someone Who Ships Data

Spark DataFrame interview questions about nulls and schemas feel like syntax tests but they're actually pipeline reliability tests. Interviewers want to know whether you'll build something that breaks silently or fails loudly with a useful error.

How do you clean nulls and missing values without breaking the pipeline?

The one-line answer — `df.dropna()` or `df.fillna(0)` — is what the interviewer expects you to say first. The production answer is what they're actually testing for: the decision about whether to drop or impute depends on what the null means and what's downstream.

The customer-record example: a null in `email_address` might be fine to drop if you're building an email campaign list, but catastrophic to drop if that record needs to join against a transaction table on `customer_id`. Dropping it silently breaks the join and produces a count discrepancy nobody notices until a business report is wrong. The right answer includes: log null counts before and after, validate that null rates are within expected bounds, and fail the pipeline explicitly if nulls exceed a threshold rather than silently dropping rows that should have been there.

How do you deal with schema changes when upstream data keeps moving?

Schema drift — a new column appearing, a type changing from `int` to `string`, a field being renamed — is one of the most common causes of silent pipeline failures. The interview answer that sounds senior: distinguish between additive changes (new columns) and breaking changes (type changes or renames), and handle them differently.

For a daily ingestion feed, a good answer covers: read with `inferSchema=False` and an explicit schema definition so unexpected types fail fast; use `mergeSchema=True` in Delta Lake or Parquet reads to handle additive changes gracefully; and emit a schema validation check at the start of the job that compares the incoming schema against the expected one and raises an alert — not just an exception — when they diverge. The follow-up: "What if you can't control the upstream schema?" Then you need a schema registry or a contract with the upstream team, and your pipeline should treat schema drift as a first-class observable event, not an edge case.

How would you answer a scenario question about building an incremental pipeline?

The structure interviewers want: detect new data (by watermark, by max timestamp, or by partition date), process only the delta, write in an idempotent way so reruns don't produce duplicates, and checkpoint your progress so restarts don't reprocess everything.

The date-partitioned sales feed example: each day's data lands in `s3://bucket/sales/date=2024-01-15/`. Your pipeline reads the latest partition, processes it, and appends to the output table. The failure mode to mention: if your write isn't idempotent and the job fails halfway through, a rerun will double-count that day's sales. The fix is either a `MERGE` operation (Delta Lake handles this cleanly) or writing to a staging partition and atomically swapping it in. Mentioning watermarking or checkpointing in Structured Streaming shows you've thought about the streaming version of the same problem.

How do you explain a join that suddenly starts returning fewer rows than expected?

Walk through the likely causes in order: null keys on either side of the join (null never matches null in a standard join); the wrong join type (inner join drops non-matching rows, left join keeps them); schema drift that changed a key column's type so the join condition silently fails; or duplicate keys that were deduplicated somewhere upstream without anyone noticing.

The troubleshooting path that sounds senior: check row counts before and after the join, inspect null rates on the join keys, verify the join type matches the business logic, and look at whether the key column's data type is consistent on both sides. A string `"123"` and an integer `123` will not match in a Spark join without an explicit cast. Interviewers use this question to see whether you debug systematically or guess.

Debug the Ugly Stuff: Slow Jobs, OOMs, Spills, and Stage Failures

PySpark troubleshooting questions are the ones that make senior candidates look senior. If you can describe a real debugging path through Spark UI, you've already separated yourself from most of the field.

What causes slow jobs, and how do you diagnose them in Spark UI?

Start in the Stages tab. Look for stages with one task that's dramatically slower than the rest — that's skew. Look for high shuffle read/write sizes — that's a sign your join or aggregation is moving too much data. Look for GC time eating into task time — that's memory pressure. Look for task retries — that's executor instability, often from OOMs or network issues.

The Spark UI gives you median and max task duration per stage. If the max is 50x the median, you have a skew problem, not a cluster-sizing problem. If every task is slow and GC time is high, you have a memory problem. If shuffle read is enormous relative to input size, you have a join or aggregation that's moving data inefficiently. Interviewers want to hear you name the metric and connect it to a cause — not describe the UI in general terms.

Why do executor OOMs happen, and what would you change first?

OOMs happen when a single executor is asked to hold more data in memory than it has available — usually because of a wide join, a large groupBy that materializes a huge intermediate result, or a cached DataFrame that's bigger than the executor's memory fraction allows.

The first things to check: partition count (too few partitions means each partition is too large for one executor to handle), whether you're caching something you don't need to, and whether your join is shuffling the wrong side. The fix sequence: increase partition count with `repartition()` before the expensive operation, drop unnecessary caches, and check whether AQE can handle it automatically. If you're still OOMing, look at `spark.executor.memory` and `spark.memory.fraction` — but those are tuning knobs, not first responses.

What does memory spill mean, and why does it matter?

Spill happens when Spark can't fit an operation's intermediate data in executor memory and writes the overflow to disk. It's not a job failure — the job continues — but it's expensive because disk I/O is orders of magnitude slower than memory access. Shuffle-heavy operations like `groupBy`, `join`, and `orderBy` are the most common spill triggers.

The follow-up interviewers use: "How would you tell if spill is hurting your job?" Spark UI shows spill metrics in the Stages tab — both memory spill and disk spill are reported per stage. If you see a stage with high disk spill and long task times, increasing partition count to reduce the per-partition data size is usually the first fix. The deeper fix is avoiding the shuffle-heavy operation entirely — for example, replacing a full sort with a partition-aware aggregation.

How do you troubleshoot skew, stage retries, or failed tasks without panic?

The debugging path: read the error message first — Spark's error messages are often specific enough to point at the executor, the task, and the operation. Then go to Spark UI and find the failed stage. Look at the task list and sort by duration — the outlier task is almost always the problem. Check whether the outlier is on one executor consistently (executor instability) or on different executors with the same key (data skew).

For stage retries, check executor logs for the actual exception. OOM exceptions and network timeout exceptions look similar in the stage view but have completely different fixes. Connecting the symptom — "stage 4 retried three times" — to the cause — "executor on node 3 ran out of memory processing the skewed customer_id partition" — is the answer that sounds like someone who's actually fixed a broken job at 2am.

The Follow-Up Traps That Separate Real Answers from Rehearsed Ones

Spark interview questions at the follow-up level are where most candidates stall. The first answer is rehearsed. The follow-up tests whether there's anything behind it.

What follow-up would you expect after a good DataFrame answer?

After you explain DataFrames and Catalyst, the interviewer will often ask: "Can you show me what an execution plan looks like?" or "When would you use `explain()` and what would you look for?" The answer: `.explain(True)` prints the logical and physical plan, and what you're looking for is whether filters are pushed down early, whether joins are using broadcast or sort-merge, and whether there are any unnecessary shuffles in the physical plan.

The deeper follow-up: "When would you drop to the RDD API from a DataFrame?" This is testing whether you know the limits of Catalyst — specifically, that complex custom transformations in Python that can't be expressed as Spark SQL operations are candidates for RDD processing or Pandas UDFs, with the explicit acknowledgment that you're trading optimization for flexibility.

What do interviewers usually ask after you mention broadcast joins?

The trap is the size threshold. Spark's default `autoBroadcastJoinThreshold` is 10MB. If you mention broadcast joins as a general performance technique without acknowledging the limit, the follow-up will be: "What happens if you broadcast a 2GB table?" The answer: executor OOM, because every executor gets a full copy. Stale table statistics can also cause Spark to choose a broadcast join when it shouldn't — if Spark's estimate of table size is wrong, it may broadcast a table that's actually much larger than it thinks.

The answer that shows depth: "I use broadcast joins deliberately with `broadcast()` hints when I know the table is small and the statistics might be stale, rather than relying on the auto-threshold." That one sentence shows you've been burned by the auto-threshold before.

How do you answer the "what would you do next?" version of a performance question?

The structure that works: diagnosis first, one sentence; proposed fix, one sentence; expected outcome, one sentence. "The stage is slow because one partition is processing 80% of the rows due to skew on customer_id. I'd salt the key with a random suffix to distribute it across 10 partitions and then aggregate the results. That should bring the max task time down from 12 minutes to roughly the median of 45 seconds."

For a slow aggregation job specifically, the candidate should move from diagnosis — high shuffle read, long task times — to action: filter earlier to reduce the data before the aggregation, increase partition count to reduce per-partition size, or check whether AQE's skew join optimization is enabled and whether it's actually firing.

How do you keep your answer short without sounding shallow?

The shape of a strong 30-second answer: one sentence of what, one sentence of why, one sentence of when or tradeoff. "Broadcast joins send a copy of the small table to every executor, which eliminates the shuffle on the large table. They're the right choice when the small table fits comfortably in executor memory — under the broadcast threshold. When the table is too large or statistics are stale, they'll cause OOMs instead of saving time."

The follow-up that tests depth is always about the edge case or the failure mode. If you've given a clean answer and the interviewer says "and when would that go wrong?" — that's not a challenge, it's an invitation. The candidates who answer "it always goes wrong when X" with a specific example are the ones who get the next round.

How Verve AI Can Help You Prepare for Your Interview With PySpark

The gap between knowing PySpark and explaining it cleanly under live pressure is a rehearsal problem, not a knowledge problem. You can read every section of this article and still fumble the answer when an interviewer asks "okay, but what would you check first in Spark UI?" — because you've never had to say it out loud while someone is watching and waiting. That's the specific gap Verve AI Interview Copilot is built to close. It listens in real-time to the actual conversation — not a canned prompt — and responds to what you actually said, which means when you give a half-answer about broadcast joins, Verve AI Interview Copilot surfaces the follow-up the interviewer would actually use next. You're not practicing against a static question list. You're practicing against a system that pushes back the way a real interviewer does. Verve AI Interview Copilot stays invisible while it does this, so the rehearsal feels like the real thing, not a safety net. If the section on skew, spills, and OOMs felt like territory you'd struggle to explain out loud in 30 seconds, that's exactly where to run a live session before your next interview.

Conclusion

The goal in a PySpark interview is not to sound encyclopedic — it's to answer cleanly, survive the follow-up, and keep moving. The candidates who do that aren't the ones who memorized the most definitions. They're the ones who practiced saying the answer out loud, including the part where the interviewer says "okay, and when would that go wrong?"

Take the questions from this guide that felt shaky and practice the 30-second version of each answer before your next interview. Say it out loud. Time it. Then say it again with the follow-up. That's the preparation that actually transfers to the room.

JE

Jordan Ellis

Interview Guidance

Ace your live interviews with AI support!

Get Started For Free

Available on Mac, Windows and iPhone