Practice 30 Kubernetes troubleshooting interview questions for 2026, with concise answers, command examples, and the diagnostic sequence interviewers expect.
Kubernetes Troubleshooting Interview Questions: 30 Most Asked (2026)
Kubernetes troubleshooting interviews have moved on. A few years ago, "what is a Pod?" was a reasonable opener. In 2026, interviewers hand you a broken cluster and ask you to talk through the fix — out loud, under time pressure, with follow-ups designed to see whether you actually understand the system or just memorized a blog post.
This list covers 30 questions split across fresher, mid-level, and senior tiers. Each answer is short on purpose — the diagnostic reasoning and the concrete command matter more than a paragraph of context-setting. Bookmark the tier that matches where you are right now.
What interviewers actually test
They're not testing whether you can recite the Kubernetes docs. They're testing whether you have a repeatable diagnostic process and can apply it under pressure.
The mental checklist most experienced interviewers expect to hear, in roughly this order: recent changes and rollout history, pod status and events, service and ingress configuration, application logs, resource metrics, and environment variables. If your first instinct on any troubleshooting question is to walk through that sequence — adapting it to the specific failure domain — you're already ahead of most candidates.
One more shift worth noting: 2026 interviews lean heavily on scenario prompts. You won't just be asked "what is a liveness probe?" You'll be asked what happens when one is misconfigured and how you'd find it in production.
Fresher level questions
These target candidates with 0–1 years of experience — CKA-studying, bootcamp background, or early in a DevOps role. Interviewers want to see that you know the basic diagnostic tools and can reason about common failure states.
1. What does CrashLoopBackOff mean and how do you start debugging it?
The container started, crashed, and Kubernetes is restarting it with exponential backoff. Start with `kubectl describe pod <name>` to check events and exit codes, then `kubectl logs <pod> --previous` to see the last crash output. Common causes: missing config, bad entrypoint, unresolvable dependencies.
2. How do liveness and readiness probes differ, and what happens when each fails?
A failed liveness probe restarts the container. A failed readiness probe removes the pod from Service endpoints — the container keeps running, but it stops receiving traffic. Confusing the two is one of the most common misconfigurations in production.
3. A pod is stuck in Pending — what are the first three things you check?
Node resource availability (CPU/memory), taints and tolerations that might prevent scheduling, and PVC binding status if the pod requests persistent storage. `kubectl describe pod` will usually tell you which one it is.
4. What is the difference between kubectl logs and kubectl describe ?
`kubectl logs` shows container stdout/stderr — your application output. `kubectl describe` shows the resource's spec, status, conditions, and event history. Use `describe` to understand why something happened; use `logs` to understand what the application did.
5. How do you check if a node is healthy?
`kubectl get nodes` shows high-level status. `kubectl describe node <name>` shows conditions — MemoryPressure, DiskPressure, PIDPressure, Ready. If any condition other than Ready is True, that node has a problem worth investigating.
6. What does OOMKilled mean and what signal is sent to the process?
The container exceeded its memory limit. The kernel sends SIGKILL (signal 9) — no graceful shutdown, no cleanup. Fix it by raising the memory limit, fixing the memory leak, or both.
7. How do ConfigMaps and Secrets differ, and what breaks if you misconfigure one?
Secrets are base64-encoded and intended for sensitive data; ConfigMaps are plaintext. Common breakage: referencing a key that doesn't exist causes the pod to fail to start if it's a required env var, or the mount path is empty if it's a volume mount. `kubectl describe pod` will show the mount error in events.
8. What is a DaemonSet and when would a missing DaemonSet cause a cluster issue?
A DaemonSet runs one pod per node. CNI plugins and log collectors typically run as DaemonSets. If the CNI DaemonSet is missing or broken on a new node, that node has no pod networking. If the log collector DaemonSet is gone, you lose observability on every affected node.
Mid level questions
These target candidates with 2–4 years of experience who own deployments in production and are expected to debug issues without escalating every time.
9. A rolling update is stuck — how do you diagnose and roll back?
`kubectl rollout status deployment/<name>` shows whether it's progressing. `kubectl rollout history` shows revisions. If new pods aren't passing readiness probes, the rollout stalls. `kubectl rollout undo deployment/<name>` reverts to the previous revision.
10. Pods are running but the Service isn't routing traffic — where do you look?
Check that the Service's label selector matches the pod labels exactly. `kubectl get endpoints <service>` shows whether pods are registered. If the endpoint list is empty, it's a selector mismatch. If endpoints exist but traffic fails, check kube-proxy logs and whether the target port matches the container port.
11. CoreDNS is returning NXDOMAIN — how do you debug DNS in the cluster?
`kubectl exec` into a pod and run `nslookup <service>.<namespace>.svc.cluster.local`. If that fails, check CoreDNS pod logs (`kubectl logs -n kube-system -l k8s-app=kube-dns`) and the CoreDNS ConfigMap for misconfigured forwarding rules or missing zone entries.
12. HPA isn't scaling — what are the likely causes?
If metrics-server is down, HPA decisions freeze entirely — it can't scale without metrics. Other causes: resource requests not set on the target deployment (HPA needs requests to calculate utilization), or the cooldown period hasn't elapsed since the last scaling event.
13. A PVC is stuck in Pending — what do you check?
StorageClass existence and provisioner health, access mode mismatch (ReadWriteOnce vs ReadWriteOncePod — the latter, added in Kubernetes 1.22, restricts to a single pod on a single node), and whether the underlying cloud provider has capacity in the requested zone.
14. How do you debug a pod that's running but returning 502s?
Start with application logs inside the pod. Then check the readiness probe — if it's passing but the app isn't actually ready, the pod receives traffic it can't handle. Check the upstream service the pod depends on. If an Ingress controller sits in front, check its logs for backend connection errors.
15. A node goes NotReady — what is the default eviction timeline and what triggers it?
The kubelet sends heartbeats to the API server. If heartbeats stop, the node controller marks it NotReady. The default eviction timeout is 300 seconds — after that, pods on the node are scheduled for eviction. Check node conditions with `kubectl describe node` and kubelet logs on the node itself.
16. How do you investigate resource pressure causing pod evictions?
`kubectl top nodes` and `kubectl describe node` show resource usage and conditions. Kubernetes evicts pods based on QoS class: BestEffort first, then Burstable, then Guaranteed last. If you're seeing unexpected evictions, check whether pods have resource requests and limits set correctly.
17. An init container is failing — how does that affect the main container?
The main container never starts. Init containers run sequentially before the main container, and if any init container fails, the pod restarts from the first init container. Debug with `kubectl logs <pod> -c <init-container-name>` — the main container logs will be empty because it never ran.
18. How do you use kubectl debug to troubleshoot a running pod?
`kubectl debug -it <pod> --image=busybox --target=<container>` attaches an ephemeral container to a running pod. This is especially useful for distroless images where you can't exec into the container because there's no shell. The ephemeral container shares the pod's network and process namespace.
Senior level questions
These target 5+ year candidates — SREs, platform engineers, people who own cluster reliability and get paged when things break at scale.
19. etcd is showing high fsync latency — what is the acceptable threshold and what do you do?
WAL fsync latency should stay under 10 ms. Above that, etcd performance degrades and API server responsiveness suffers. Investigate disk I/O on the etcd nodes — check for noisy neighbors, insufficient IOPS, or network-attached storage with high latency. Monitor with etcd's built-in Prometheus metrics (`etcd_disk_wal_fsync_duration_seconds`).
20. The Cluster Autoscaler isn't adding nodes — what constraints do you check?
Node group maximum limits, taints on the node group that pending pods don't tolerate, pod anti-affinity rules that can't be satisfied with the available node types, and whether the pending pods have annotations the autoscaler can't interpret. `kubectl describe pod` on the pending pods usually reveals the scheduling constraint.
21. A StatefulSet pod is stuck after a node failure — how do you recover it?
StatefulSet pods have stable identities and their PVCs are bound to specific nodes. If the node is gone, the pod can't be rescheduled because Kubernetes won't delete it automatically — it assumes the node might come back. You need to manually delete the pod (`kubectl delete pod <name> --force --grace-period=0`) and verify the PVC can be reattached on the new node.
22. NetworkPolicy is blocking traffic you expect to be allowed — how do you trace it?
The most common mistake: forgetting that once you apply any NetworkPolicy selecting a pod, all traffic not explicitly allowed is denied. Check ingress vs egress direction — a policy allowing ingress doesn't automatically allow egress. Missing namespace selectors are another frequent cause. `kubectl describe networkpolicy` shows the rules; test with `kubectl exec` and `curl` or `wget` from the affected pod.
23. How do you implement zero downtime deployments and what can still go wrong?
Set `maxSurge` and `maxUnavailable` appropriately, use PodDisruptionBudgets to prevent too many pods going down simultaneously, and configure readiness probes with realistic timing. What still breaks: preStop hooks that are too short (connections drain before the pod is removed from endpoints), or load balancers that cache old endpoints.
24. How do you enforce image trust and admission control at scale?
Use a policy engine — OPA Gatekeeper or Kyverno — with policies that reject pods using unsigned images or images from untrusted registries. These run as validating (or mutating) admission webhooks. The key operational concern: if the webhook itself goes down, decide in advance whether the cluster should fail-open or fail-closed.
25. An application can't reach an external database — what is your diagnostic path?
Start inside the pod: can it resolve the database hostname? (`nslookup`). Can it reach the IP? (`curl`, `nc`). If DNS works but the connection times out, check NetworkPolicy egress rules, then VPN or peering configuration between the cluster network and the database network. If using an ExternalName or Endpoints service, verify the external IP is correct and the port matches.
26. How do you back up and restore etcd, and when would you need to?
`etcdctl snapshot save <file>` creates a backup. `etcdctl snapshot restore <file>` restores it. You need this when etcd data is corrupted, when you lose quorum (majority of etcd members down), or before a risky cluster upgrade. Always test the restore process in a non-production environment — a restore replaces all cluster state.
27. How do you handle multi tenancy isolation failures in a shared cluster?
Layer the defenses: namespace-scoped RBAC so tenants can't see each other's resources, ResourceQuotas to prevent one tenant from starving others, NetworkPolicies to isolate tenant traffic, and node-level isolation via taints and dedicated node pools for sensitive workloads. When isolation fails, it's usually because one of these layers was skipped or misconfigured.
28. How do you diagnose a slow application in Kubernetes without application level access?
`kubectl top pods` and `kubectl top nodes` show resource consumption. Check whether the pod is being throttled (CPU limits too low). Look at the metrics pipeline — Prometheus, if available — for request latency and error rates. If a service mesh is in place, use its observability layer (distributed traces, per-service latency dashboards) to isolate which hop is slow.
29. How do you safely upgrade a Kubernetes cluster?
Upgrade the control plane first (API server, controller manager, scheduler), then the nodes. Drain nodes one at a time (`kubectl drain --ignore-daemonsets`), upgrade the kubelet, and uncordon. Back up etcd before starting. Verify PodDisruptionBudgets won't block the drain. Test the upgrade in a staging cluster first — always.
30. Kubelet is not starting on a node — what do you check?
Check `systemctl status kubelet` and `journalctl -u kubelet` for error output. Common causes: expired certificates, misconfigured kubelet config file, container runtime (containerd/CRI-O) not running, or the node can't reach the API server. If certificates expired, regenerate them with `kubeadm certs renew` and restart the kubelet.
How to practice
Reading answers is useful. Saying them out loud under time pressure is different.
The best prep: spin up a local cluster with kind or minikube and deliberately break things. Kill a node. Exhaust memory on a pod. Misconfigure a NetworkPolicy. Then talk through your diagnostic process as if someone is listening — because in the interview, someone will be.
If you want to practice verbalizing these scenarios with real-time feedback, Verve AI's Interview Copilot lets you run mock Kubernetes troubleshooting sessions and get structured feedback on your reasoning — not just whether your answer was correct, but how you communicated it. You can try it free at vervecopilot.com.
Quick reference: the diagnostic sequence
When you don't know where to start, start here: pod status and events → container logs → node health and conditions → networking (DNS, Services, NetworkPolicy) → storage (PVC binding, access modes) → control plane (API server, etcd, kubelet). That sequence covers the majority of Kubernetes failures, and walking through it out loud in an interview shows the interviewer you have a system — not just a collection of memorized commands.
Verve AI
Archive
