Use these PyTorch interview questions and 25 answers for junior, mid-level, and senior roles, covering training loops, GPU, and autograd.
Most candidates prepping for PyTorch interviews don't fail because they don't know PyTorch. They fail because they studied the wrong layer. They memorized tensor definitions when the interviewer wanted a training loop walkthrough. They rehearsed autograd explanations when the follow-up was about gradient accumulation bugs. The PyTorch interview questions that actually appear in screens vary sharply by role — and treating them as one flat list is the fastest way to over-prepare on basics and under-prepare on everything that actually gets you hired.
This guide answers 25 of the most commonly asked PyTorch interview questions organized by role level — junior, mid-level, and senior — so you can study the right depth instead of guessing.
What interviewers expect at junior, mid-level, and senior levels
What do junior PyTorch answers need to prove?
At the junior level, the bar is not sophistication — it's clarity. Interviewers want to know whether you can define tensors, explain autograd without hand-waving, write a basic training loop, and use `nn.Module` without copying blindly from a tutorial. The questions sound simple because they are, but a surprising number of candidates fail them by over-explaining the wrong thing.
What a hiring manager is listening for is whether the candidate's explanation connects to why. Not just "a tensor is like a NumPy array" but "a tensor is like a NumPy array, except it can live on a GPU and track gradients for backpropagation." That second sentence signals that the candidate has actually used PyTorch, not just read about it. Confidence without that connection is the main tell that someone is reciting rather than reasoning.
What changes at mid-level?
At mid-level, the interview stops testing whether you know the API and starts testing whether you can reason about behavior. The bar shifts to questions like: why does your training loop converge slowly? What happens if you forget `optimizer.zero_grad()`? Why isn't your model improving even though the loss is going down?
A real mid-level screening conversation often starts with something like: "Walk me through your training loop." The candidate who says "forward pass, compute loss, backward, optimizer step" passes the vocabulary test. The candidate who adds "and I make sure to zero gradients before backward, because PyTorch accumulates them by default and that will silently corrupt your updates" passes the reasoning test. Mid-level interviews are looking for the second answer.
What separates senior answers from polished memorization?
Senior PyTorch answers need tradeoffs and system-level judgment. When a hiring manager asks about mixed precision training, the answer they want isn't "use `torch.cuda.amp.autocast()`" — it's an explanation of when AMP helps (large batch, memory-constrained jobs), when it doesn't (certain loss landscapes, small models where the overhead isn't worth it), and what loss scaling is actually doing.
The same pattern applies to distributed training. A candidate who says "DDP is faster than DataParallel" has the right answer. A candidate who explains why — each process gets its own model copy and gradients are synchronized via all-reduce, which eliminates the GIL bottleneck and the single-GPU parameter server that makes DataParallel the wrong choice at scale — is the one who gets the offer. PyTorch's official distributed training documentation is worth reading carefully before any senior screen.
The screening signal hiring managers use: ask a follow-up. "Why did you make that choice?" or "What would break if you didn't do that?" Strong candidates answer from experience. Polished memorizers go quiet.
The 25 PyTorch interview questions that come up most often
These PyTorch interview questions are organized by level. Junior questions test clarity and working knowledge. Mid-level questions test reasoning and debugging. Senior questions test tradeoffs and production judgment.
Junior: What is a tensor in PyTorch, and how is it different from a NumPy array?
A tensor is PyTorch's fundamental data structure — a multi-dimensional array that can live on a GPU and participate in automatic differentiation. NumPy arrays do neither. A `torch.Tensor` tracks the computation graph when `requires_grad=True`, which means PyTorch knows how to compute gradients through it. The follow-up interviewers love: "What happens when you move a tensor to GPU?" The answer is `.to('cuda')` or `.cuda()`, and the critical point is that the model and all its inputs need to be on the same device — a mismatch raises a runtime error that trips up beginners more than almost anything else.
Junior: What does autograd actually do?
Autograd builds a dynamic computation graph as your code runs forward, then traverses it in reverse during `loss.backward()` to compute gradients. Every tensor operation creates a node in this graph. When you call `.backward()`, PyTorch walks the graph from the loss back to each parameter, applying the chain rule at each node. The follow-up: "When does PyTorch stop tracking?" Answer: when `requires_grad=False` on a tensor, or inside a `torch.no_grad()` block. Interviewers want to know you understand that gradient tracking has a memory cost and that you turn it off deliberately during inference.
Junior: Why do we call optimizer.zero_grad() before backward()?
PyTorch accumulates gradients by default — calling `.backward()` adds to whatever gradients already exist in `.grad`, it does not overwrite them. If you forget `zero_grad()`, your gradient updates compound across iterations. A small training loop makes this concrete: after step 1, `param.grad` holds the gradient from batch 1. After step 2 without zeroing, it holds the sum of batch 1 and batch 2. Your optimizer is now updating on the wrong signal, and the model will either diverge or learn much more slowly. The bug is silent — no error, just wrong behavior.
Junior: What is nn.Module responsible for?
`nn.Module` is the base class for all neural network components in PyTorch — it manages parameters, submodules, device placement, and serialization. It is not just a container. When you subclass `nn.Module` and define `forward()`, you get parameter registration, `.parameters()` iteration, `.to(device)` propagation, and `.state_dict()` for free. Interviewers care about this because custom model classes that don't subclass `nn.Module` correctly — for example, storing layers as plain Python lists instead of `nn.ModuleList` — silently break parameter registration, and the optimizer never sees those parameters.
Junior: What's the difference between nn.Module and nn.Sequential?
`nn.Sequential` is a convenience wrapper for a linear stack of layers; `nn.Module` is the foundation you need when your architecture has any logic at all. Sequential is genuinely useful — for a simple MLP or a convolutional feature extractor with no branching, it's clean and readable. But the moment you need skip connections, conditional logic, or multiple inputs, Sequential can't express it. A ResNet block, for example, adds the input to the output of the residual stack — that addition requires a custom `forward()` method inside a proper `nn.Module` subclass. The question is really about design judgment: do you know when the shortcut is enough and when it isn't?
Junior: What is torch.no_grad() for?
`torch.no_grad()` disables gradient tracking for all operations inside its block, reducing memory usage and speeding up computation during inference. During training, PyTorch stores intermediate values in the computation graph so it can compute gradients. That storage costs memory. When you're running a validation loop or serving predictions, you don't need gradients — wrapping the loop in `with torch.no_grad():` tells PyTorch to skip that bookkeeping entirely. Forgetting it during a long validation epoch won't crash your code, but it will silently eat GPU memory and slow down your evaluation.
Junior: Why do parameters have requires_grad=True?
`requires_grad=True` tells PyTorch to include a parameter in the computation graph so gradients flow through it during backpropagation. By default, `nn.Module` parameters have it set to `True`. The practical follow-up is about transfer learning: when you load a pretrained model and want to freeze the backbone, you set `requires_grad=False` on those parameters so the optimizer doesn't update them. Only the replacement head gets trained. Interviewers use this question to check whether the candidate understands that gradient tracking is opt-in per-tensor, not a global switch.
Mid-level: How do autograd, backward(), and optimizer.zero_grad() work together in a real training loop?
The correct sequence is: zero gradients, forward pass, compute loss, backward pass, optimizer step — and the order is not arbitrary. Zeroing before forward ensures you're computing gradients only for the current batch. `loss.backward()` populates `.grad` on every parameter. `optimizer.step()` uses those gradients to update weights. The common failure: moving `zero_grad()` after `step()` or skipping it entirely. In a loop over 100 batches, gradients accumulate across all 100 before the first update, which produces a wildly incorrect update signal. Interviewers ask this because it's the single most common training bug in code they see from candidates.
Mid-level: How do Dataset, DataLoader, batching, shuffling, and collate_fn fit together?
`Dataset` defines how to get one sample; `DataLoader` wraps it to handle batching, shuffling, parallelism, and collation. The default `collate_fn` works fine when all samples have the same shape — it just stacks tensors. It breaks the moment you have variable-length sequences or images of different sizes. A text classification dataset where sentences have different lengths needs a custom `collate_fn` that pads sequences to the batch maximum before stacking. Candidates who have actually built input pipelines know this because they've hit the error. Candidates who haven't will give you the docs answer and stop there.
Mid-level: How do you move a model and tensors to GPU correctly?
The sequence is: move the model first with `model.to(device)`, then move each input batch and label tensor to the same device inside the training loop. The classic bug is moving the model but forgetting to move the inputs — PyTorch raises a `RuntimeError` about tensors being on different devices. A subtler version: the model is on CUDA, inputs are on CUDA, but labels are still on CPU. Same error. Candidates who have trained real models on CUDA have hit this exact mistake. The tell in an interview is whether they mention the device mismatch by name or just say "make sure everything is on the same device" without the specifics.
Mid-level: What is the difference between state_dict and saving the full model?
`state_dict()` saves only the parameter tensors and buffers; saving the full model serializes the model class definition along with the weights using `pickle`. The portability problem with full-model saving: if you rename the class, move the file, or change the architecture, `torch.load()` fails because it can't find the original class definition. `state_dict` is safer — you instantiate the model yourself, then load the weights into it. For checkpointing during training, `state_dict` is almost always the right choice. PyTorch's serialization documentation is explicit about this recommendation.
Mid-level: How do train() and eval() change model behavior?
`model.train()` and `model.eval()` switch behavioral modes for layers like dropout and batch normalization that behave differently during training versus inference. Dropout is active in train mode and disabled in eval mode. Batch norm uses running statistics in eval mode instead of batch statistics. The mistake that costs people in interviews: forgetting to call `model.eval()` before the validation loop. Your validation loss will be artificially noisy because dropout is randomly zeroing activations, and batch norm is using batch statistics instead of the stable running estimates. The model looks like it's not converging when the loop logic is just wrong.
Mid-level: When would you write a custom loss function in PyTorch?
You write a custom loss when the built-in losses don't match your problem's objective — class imbalance, contrastive learning, and multi-task setups are the common cases. A focal loss for imbalanced detection tasks, for example, down-weights easy examples so the model focuses on hard ones — that's not in `torch.nn`. The follow-up interviewers ask: "Is your custom loss differentiable?" The answer needs to show you understand that every operation in the loss must be expressible through PyTorch's autograd graph. If you call a NumPy function inside the loss, gradients stop there. Reduction choices — `mean` vs `sum` — also matter because they affect the effective learning rate.
Mid-level: How do you explain transfer learning in PyTorch without sounding vague?
Transfer learning in PyTorch means loading a pretrained model, freezing the backbone layers by setting `requires_grad=False`, replacing the classification head with one that matches your task, and fine-tuning. The distinction that matters: feature extraction (backbone frozen, only head trained) versus fine-tuning (backbone unfrozen after initial head training). For a pretrained ResNet50 on a medical imaging task, you'd typically train the head for a few epochs first, then unfreeze the last few residual blocks and fine-tune at a lower learning rate. Interviewers want to hear that you know the difference and have a reason for choosing one over the other.
Senior: How do you debug exploding or vanishing gradients in PyTorch?
Start by inspecting gradient norms during training — `torch.nn.utils.clip_grad_norm_` will tell you if norms are exploding before you clip them. The diagnosis path: check gradient norms layer by layer, review the learning rate (too high causes explosion, too low amplifies vanishing), check weight initialization (Xavier or He initialization matters for deep networks), confirm batch normalization or layer normalization is in place, and look at activation functions (ReLU avoids saturation; sigmoid and tanh don't). A training run that diverges after 300 steps with a loss spike usually points to a learning rate or initialization problem. One that plateaus immediately usually points to vanishing gradients in deep layers.
Senior: What are the tradeoffs between saving checkpoints with state_dict, optimizer state, and scheduler state?
A production checkpoint needs model weights, optimizer state, scheduler state, and the current epoch — saving only weights means a resumed job trains correctly but on the wrong learning rate schedule. The optimizer state contains momentum buffers and adaptive learning rate estimates (in Adam, the first and second moment estimates). If you resume without them, the optimizer resets and you lose the warmup benefit. The scheduler state tells you where you are in the LR schedule. A job that resumes from epoch 50 but starts the scheduler at epoch 0 will use a learning rate that's wrong by an order of magnitude. This is the answer that separates candidates who have actually resumed a multi-day training job from those who haven't.
Senior: How do you use PyTorch Profiler to find bottlenecks?
PyTorch Profiler captures CPU and GPU activity, kernel execution times, and memory allocation — the goal is separating data-loader stalls from compute bottlenecks. The common scenario: GPU utilization is 30% but the model architecture looks fine. The profiler shows that the GPU is idle between batches because the data loader is single-threaded and preprocessing is slow. The fix is increasing `num_workers` in DataLoader, adding prefetching, or moving preprocessing to GPU. PyTorch Profiler documentation covers the `torch.profiler.profile()` context manager and the TensorBoard integration that makes the flame graphs readable.
Senior: When does mixed precision help, and when does it hurt?
Mixed precision (AMP) reduces memory usage and increases throughput by computing in float16 where safe and float32 where precision matters — the speedup is real on Tensor Core hardware, typically 2–3x on modern NVIDIA GPUs. It hurts when the loss landscape is sensitive to numerical precision — some models, especially those with custom loss functions or unusual activation patterns, produce NaN gradients in float16 even with loss scaling. The failure mode: `GradScaler` skips the update step when it detects overflow, and if this happens repeatedly, training stalls without an obvious error. The answer interviewers want includes loss scaling, not just "use autocast."
Senior: What is Distributed Data Parallel doing that DataParallel does not?
DDP gives each process its own model replica and its own optimizer, synchronizing gradients via all-reduce after each backward pass — DataParallel uses a single process with multiple threads, which hits the Python GIL and creates a parameter server bottleneck on GPU 0. DataParallel is easy to add — one line — but it scales poorly because all gradient aggregation happens on one GPU, which becomes the bottleneck. DDP requires more setup (process groups, `init_process_group`, launching with `torchrun`) but scales linearly with GPU count because communication is peer-to-peer. For any serious multi-GPU training job, DDP is the correct answer. Saying "DDP is better" without explaining why is the polished memorization answer.
Senior: How would you explain checkpointing, resuming, and reproducibility?
Full reproducibility requires saving model weights, optimizer state, scheduler state, RNG states for Python, NumPy, and PyTorch, the current epoch, and the dataloader state. The scenario where this breaks: you save everything except the RNG states, resume at epoch 50, and the model produces different results than a continuous run would have. The data shuffling order changes because the random seed is reset. For debugging purposes, you want the resumed run to be byte-for-byte identical to a continuous run. Setting seeds at the start of training and saving `torch.get_rng_state()` alongside the checkpoint is the practice that makes this reliable.
Senior: How do you compare PyTorch and TensorFlow in an interview without sounding tribal?
PyTorch wins on flexibility and debuggability — eager execution by default means you can inspect tensors, set breakpoints, and use standard Python debugging tools anywhere in the forward pass. TensorFlow's strength is production deployment: TensorFlow Serving, TFLite, and the TF ecosystem around edge deployment are more mature. For a research-heavy team building novel architectures, PyTorch is almost always the better choice because the dynamic graph makes experimentation faster. For a team deploying to mobile or embedded hardware at scale, TensorFlow's deployment tooling is harder to match. The follow-up interviewers ask: "Which would you choose for a team doing both?" The honest answer is PyTorch with TorchScript or ONNX export for deployment.
Senior: How do you structure a reliable training loop for production?
A production training loop handles device setup, gradient zeroing, forward/backward/step sequence, validation with `model.eval()` and `torch.no_grad()`, checkpoint saving, logging, and graceful failure recovery. The gap between a notebook loop and a production loop is mostly around the edges: what happens when the job is preempted at epoch 47? Does it resume cleanly or restart from scratch? A durable loop saves a checkpoint every N epochs, logs metrics to a tracking system (Weights & Biases, MLflow), validates on a held-out set with mode switching, and handles CUDA out-of-memory errors without silently corrupting the run. The notebook demo loop has none of this. Interviewers asking this question are ruling out candidates who have only ever trained on a laptop.
Mid-level: Why does my model train but validation never improves?
The first thing to check is mode switching — if you forgot `model.eval()` before the validation loop, dropout is active and batch norm is using batch statistics, making validation metrics noisy and unreliable. If mode switching is correct, check for data leakage (validation samples appearing in training), label issues (off-by-one in class indices), and optimizer settings (a learning rate that's too high causes training loss to oscillate and validation to lag). The debugging story that hiring managers recognize: a model that shows 95% training accuracy and 52% validation accuracy, which turns out to be because the validation set was accidentally included in the training split. Reasoning from symptoms instead of guessing is what the question is testing.
Senior: What would you inspect first if training is slow but GPU usage is low?
Low GPU utilization almost always means the GPU is waiting — start with the data pipeline. Check `num_workers` in your DataLoader (the default of 0 means single-threaded, synchronous loading), check whether preprocessing is on CPU and can be moved to GPU, check batch size (too small means frequent small kernel launches with high overhead per sample), and use PyTorch Profiler to confirm where wall time is actually going. The concrete scenario: a model that achieves 30% GPU utilization because each batch requires loading high-resolution images from disk, resizing them on CPU, and then transferring to GPU. The fix is prefetching, parallel workers, and caching preprocessed tensors — not touching the model architecture at all.
Senior: What advanced PyTorch topics should a strong candidate know beyond the basics?
The topics that separate interview-ready from tutorial-ready are: DDP for multi-GPU training, AMP for memory and throughput, proper checkpointing with full state, PyTorch Profiler for bottleneck diagnosis, custom `nn.Module` design, and deployment awareness via TorchScript or ONNX. Senior screening questions are designed to rule out candidates who know the API but have never shipped a model. The tell is whether the candidate can describe a failure mode — a training job that diverged, a checkpoint that couldn't resume, a data pipeline that bottlenecked GPU utilization — and explain how they diagnosed and fixed it. Definitions are table stakes. Production experience is the differentiator.
The training loop answers that separate real PyTorch users from tourists
The PyTorch interview answers that most reliably separate experienced practitioners from well-prepared beginners are the ones about training loop behavior. Not because the questions are tricky, but because the correct answer requires understanding why the sequence matters, not just what the sequence is.
How do autograd, backward(), and optimizer.zero_grad() work together in a real training loop?
The mental model interviewers want: PyTorch accumulates gradients into `.grad` tensors. `backward()` adds to them. `zero_grad()` resets them. The correct loop order is zero → forward → loss → backward → step. A minimal code path makes the failure mode concrete:
After 10 batches without zeroing, the gradient in `.grad` is the sum of 10 backward passes. The optimizer step uses that sum, which is roughly 10x larger than intended. The model either diverges or oscillates. The fix is one line before the forward pass, but the damage from skipping it is invisible until the loss curve looks wrong.
How do you explain Dataset, DataLoader, batching, shuffling, and collate_fn clearly in an interview?
The answer that sounds like real experience: "I had a text classification dataset where sentence lengths ranged from 4 to 512 tokens. The default collate function tried to stack tensors of different sizes and raised a shape mismatch error. I wrote a custom `collate_fn` that padded each sequence to the batch maximum and returned an attention mask alongside the input IDs." That answer shows the candidate has fixed a broken input pipeline, not just read the docs. The pipeline logic — Dataset returns one sample, DataLoader batches and shuffles, collate_fn assembles the batch tensor — is table stakes. The variable-length example is what proves you've used it.
How do you debug exploding or vanishing gradients in PyTorch code?
In order: check gradient norms with `param.grad.norm()` across layers, confirm learning rate isn't too high, verify weight initialization matches activation function (He for ReLU, Xavier for tanh), confirm normalization layers are present and correctly placed, add gradient clipping with `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` if norms are spiking. A real training run that diverges at step 300 with a sudden loss spike almost always traces back to a learning rate that's too aggressive or an initialization that produces activations in the saturation zone. The PyTorch autograd documentation covers gradient computation mechanics; gradient clipping is the practical first line of defense once you've confirmed the norm is the problem.
The model, device, and checkpoint questions everyone gets wrong at least once
When should you use nn.Module instead of nn.Sequential, and what are interviewers testing with that question?
Sequential is fine — genuinely fine — for a linear stack of layers where input flows through each layer in order and nothing else happens. A simple feature extractor, a basic MLP, a convolutional block without branching: Sequential is cleaner and more readable than a custom Module for these cases. The design judgment question is what happens when the architecture needs to do something Sequential can't express. A residual block adds the input to the output of a two-layer stack. That requires a custom `forward()` method. An encoder-decoder with attention needs to pass the encoder's output to multiple decoder layers. Sequential has no mechanism for that. The question is testing whether the candidate knows the tool's limits, not whether they can define both classes.
How do you move models and tensors to GPU correctly, and what mistakes do candidates often make?
The mistake candidates make: moving the model but not the labels. PyTorch raises `RuntimeError: Expected all tensors to be on the same device`. The subtler version: moving inputs and labels but computing a loss that internally creates a tensor on CPU. The rule is that every tensor that participates in a computation must be on the same device. Candidates who have trained on CUDA mention the device mismatch error by name. Candidates who haven't give you the correct sequence without the failure case.
What is the difference between saving a state_dict and saving a full PyTorch model?
The portability tradeoff is the answer hiring managers want: `state_dict` is just a dictionary of tensors — it has no dependency on your class definition. Full model saving uses `pickle` to serialize the object, which means the class must exist at the same import path when you load it. Rename the file, refactor the package, or change the class name, and `torch.load()` raises an `AttributeError`. For checkpointing during a multi-day training run or sharing a model with a collaborator, `state_dict` is the right answer. PyTorch's checkpointing best practices make this recommendation explicit.
Transfer learning and PyTorch vs TensorFlow: the two comparison questions that still matter
PyTorch interview prep for mid and senior roles almost always includes at least one of these two questions. They're comparison questions, which means the interviewer is testing whether you have a framework for making decisions, not just a preference.
How do you explain transfer learning in PyTorch without sounding vague?
The answer that hiring managers recognize: load a pretrained model (`torchvision.models.resnet50(pretrained=True)`), freeze the backbone by iterating over parameters and setting `requires_grad=False`, replace `model.fc` with a new linear layer that matches your number of classes, and train only the head for the first few epochs. Then, if the validation metric plateaus, unfreeze the last residual block and fine-tune at a learning rate one order of magnitude smaller than the head's. The distinction between feature extraction and fine-tuning is what separates a vague answer from a specific one. A strong mid-level answer names the layer being replaced and explains why the learning rate changes during unfreezing.
How do you compare PyTorch and TensorFlow in an interview without sounding tribal?
The clean comparison: PyTorch uses eager execution by default, which makes debugging straightforward — you can inspect any tensor at any point in the forward pass with a print statement or a debugger. TensorFlow historically used static graphs (though TF2 added eager mode), which made debugging harder but enabled better ahead-of-time optimization. For deployment, TensorFlow's ecosystem — TF Serving, TFLite, TF.js — is more mature and better supported across hardware targets. For research and experimentation, PyTorch's dynamic graph and Python-native feel make iteration faster. The follow-up question — "which would you choose for a team doing novel architecture research?" — has a clear answer: PyTorch, because the debugging experience during development is significantly better.
What are the key PyTorch topics hiring managers expect beyond basic definitions?
The real screening signals, in rough order of weight: training loop correctness (zero_grad, mode switching, device alignment), debugging ability (gradient diagnosis, validation behavior, data pipeline issues), saving and loading (state_dict vs full model, full checkpoint state), data loading (Dataset, DataLoader, custom collate), transfer learning mechanics, and — for senior roles — DDP, AMP, profiling, and deployment awareness. The gap between tutorial-ready and interview-ready is the debugging and production sections. A candidate who can define every class in `torch.nn` but can't explain why a model trains but doesn't improve has prepared for the wrong test. A hiring manager who has reviewed 50 PyTorch candidates will test the debugging questions first, because those answers are the hardest to fake.
How Verve AI Can Help You Prepare for Your Interview With PyTorch
The structural problem with preparing for PyTorch interviews isn't access to information — it's that the questions that actually matter are the ones requiring live reasoning under follow-up pressure. You can read every answer in this guide and still blank when an interviewer asks "why did you make that choice?" or "what would break if you didn't do that?" Those follow-ups only get easier with practice that responds to what you actually say, not a canned script.
Verve AI Interview Copilot is built for exactly this. It listens in real-time to your answers during mock sessions and responds to what you actually said — not a generic prompt. If you give a shallow answer about `state_dict`, Verve AI Interview Copilot surfaces the follow-up about portability and code dependencies. If you explain DDP without mentioning all-reduce, it pushes on the communication mechanism. The tool runs in the background and stays invisible during live sessions, so you can use it for real-time support without breaking your focus. For PyTorch interview prep specifically, Verve AI Interview Copilot lets you drill the debugging and production questions — the ones that separate real experience from polished memorization — until the reasoning is genuinely yours.
Closing the loop on the tier problem
Not every PyTorch interview wants the same depth. A junior screen is testing whether you understand tensors, autograd, and a basic training loop without hand-waving. A mid-level screen is testing whether you can reason about why a loop behaves the way it does and debug it when it doesn't. A senior screen is testing whether you have the system-level judgment to make tradeoffs about DDP, AMP, checkpointing, and profiling — and explain your reasoning under follow-up.
The candidates who over-prepare on definitions and under-prepare on debugging fail mid-level screens. The candidates who know the API but can't explain a production training loop fail senior screens. Now you know which layer to study.
Before your interview, practice one junior answer out loud (try the autograd question — explain it to someone who hasn't used PyTorch), one mid-level answer (walk through a complete training loop and name every failure mode), and one senior answer (explain DDP vs DataParallel with the communication mechanism). Those three answers, done well, will tell an interviewer more about your actual PyTorch experience than any list of definitions.
Casey Rivera
Interview Guidance

