Pod Troubleshooting

2026-05-27·CKA k8s Practice

CKA Exam Domain 5 — Common Pod troubleshooting, CrashLoopBackOff, ImagePullBackOff, Pending state

← Back to CKA Practice Index Pods are the smallest scheduling unit in Kubernetes. Pod failures are the most common troubleshooting scenario in the CKA exam.

1. Pod Status Quick Reference

Status	Meaning
`Pending`	Pod not yet scheduled, or image being pulled
`Running`	Pod running normally
`CrashLoopBackOff`	Container repeatedly crashes and restarts
`ImagePullBackOff`	Image pull failed
`ErrImagePull`	Image pull error
`OOMKilled`	Container killed due to memory overrun
`CreateContainerConfigError`	Container configuration error (e.g., ConfigMap does not exist)
`Init:Error` / `Init:CrashLoopBackOff`	Init container failed
`Terminating`	Pod is terminating (may be stuck)

2. CrashLoopBackOff Troubleshooting

# 1. Check Pod status
kubectl get pods

# 2. View container logs
kubectl logs <pod-name>

# 3. View logs from the previous crashed instance
kubectl logs <pod-name> --previous

# 4. View Pod details (look for error reasons in the Events section)
kubectl describe pod <pod-name>

# 5. Enter the container for inspection
kubectl exec -it <pod-name> -- /bin/sh

Common Causes:

Cause	Troubleshooting Method
Application code error	`kubectl logs` to check errors
Startup command failure	Check Dockerfile ENTRYPOINT / CMD
Configuration error	Check ConfigMap / Secret mounts
Health check failure	Check liveness / readiness probe configuration
Port conflict	Check containerPort configuration

3. ImagePullBackOff / ErrImagePull Troubleshooting

# 1. View Pod details
kubectl describe pod <pod-name>

# Output will show something like:
# Failed to pull image "nginx:latst": rpc error: ...
# Error: ErrImagePull
# Back-off pulling image "nginx:latst"

Common Causes and Solutions:

Cause	Solution
Image name typo	Check the image field, e.g., `nginx:latst` should be `nginx:latest`
Image tag does not exist	Use `kubectl edit pod` to modify the tag
Private registry not authenticated	Create an ImagePullSecret
Registry unreachable	Check network connectivity
Image does not exist	Confirm image has been pushed to the registry

Private Registry Authentication:

# Create Docker registry Secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<pass> \
  --docker-email=<email>

# Reference in Pod
# spec:
#   imagePullSecrets:
#     - name: regcred

4. Pending State Troubleshooting

kubectl describe pod <pod-name>

The Events section will show the reason for scheduling failure:

Reason	Solution
`0/1 nodes are available: Insufficient cpu`	Insufficient node CPU resources
`0/1 nodes are available: Insufficient memory`	Insufficient node memory resources
`0/1 nodes are available: node(s) had taint`	Node has taints, toleration needed
`0/1 nodes are available: pod has unbound PVC`	PVC not bound or does not exist
`0/1 nodes are available: node(s) didn't match node selector`	Node labels do not match

Check Resources:

# View node resource capacity
kubectl describe node <node-name>

# View node resource allocation
kubectl top node

# View Pod resource requests
kubectl get pod <pod-name> -o yaml | grep -A 5 resources

5. OOMKilled (Memory Overrun)

# Status is OOMKilled
kubectl get pod
# NAME    STATUS     RESTARTS
# my-pod  OOMKilled  5

# View logs (logs may be lost after container is OOM killed)
kubectl logs <pod-name> --previous

# View container exit reason
kubectl describe pod <pod-name>
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

Solutions:

# Increase memory limit
kubectl set resources pod <pod-name> --limits=memory=512Mi
# Or edit the Pod (Deployment)
kubectl edit deployment <deployment-name>

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

6. Init Container Failure

# View Init container status
kubectl describe pod <pod-name>

# View Init container logs
kubectl logs <pod-name> -c <init-container-name>

# View previous Init container logs
kubectl logs <pod-name> -c <init-container-name> --previous

Example:

spec:
  initContainers:
    - name: init-setup
      image: busybox
      command: ["sh", "-c", "echo 'init done'"]
  containers:
    - name: app
      image: nginx

7. Readiness / Liveness Probe Failure

kubectl describe pod <pod-name>

The Events section will show:

Warning  Unhealthy  3s (x5 over 30s)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 500
Warning  Unhealthy  10s (x3 over 50s)  kubelet  Readiness probe failed: Get "http://10.244.1.2:8080/healthz": dial tcp 10.244.1.2:8080: connect: connection refused

Troubleshooting Steps:

# 1. Confirm the application port
kubectl exec <pod-name> -- netstat -tlnp

# 2. Test the probe path
kubectl exec <pod-name> -- wget -qO- http://localhost:8080/healthz

# 3. Check probe configuration
kubectl get pod <pod-name> -o yaml | grep -A 15 livenessProbe

8. kubectl exec for Container Diagnostics

# Enter container shell
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec -it <pod-name> -- /bin/bash

# Execute commands in the container
kubectl exec <pod-name> -- ls /app
kubectl exec <pod-name> -- env
kubectl exec <pod-name> -- cat /etc/config/config.yaml

# Specify container (multi-container Pod)
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

9. kubectl debug for Temporary Debug Containers

Kubernetes v1.25+ supports temporary debug containers (Ephemeral Containers) via kubectl debug.

# Add a debug container to a running Pod
kubectl debug <pod-name> -it --image=busybox

# Copy a Pod and replace the image for debugging
kubectl debug <pod-name> -it --copy-to=<debug-name> --container=<container> --image=busybox

# Create a debug Pod for a node
kubectl debug node/<node-name> -it --image=busybox

10. General Troubleshooting Command Quick Reference

# Pod status overview
kubectl get pods -o wide
kubectl get pods --all-namespaces | grep -v Running

# View events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -w

# View full YAML
kubectl get pod <pod-name> -o yaml

# View all resource events
kubectl get events --all-namespaces

11. Exam Key Points

For CrashLoopBackOff, first check kubectl logs, then kubectl describe
ImagePullBackOff is usually caused by a typo in the image name
For Pending, check the scheduling failure reason in Events
The exit code for OOMKilled is 137
Init container logs are viewed with -c <container-name>
The Events section of kubectl describe pod is the most important diagnostic information

🧪 Complete Hands-On Example: Troubleshooting CrashLoopBackOff

Scenario

A Pod repeatedly crashes and restarts (CrashLoopBackOff). Walk through a complete troubleshooting process from start to finish, covering log inspection, configuration checking, fixing, and result verification.

Prerequisites

There is a Pod in CrashLoopBackOff state in the cluster
Permission to use kubectl logs and kubectl describe

Steps

Step 1: Identify the abnormal Pod

kubectl get pods
# NAME                      READY   STATUS             RESTARTS      AGE
# nginx-crash               0/1     CrashLoopBackOff   5 (15s ago)   2m
# web-app                   1/1     Running            0             10m

Step 2: View Pod details (look for clues in Events)

kubectl describe pod nginx-crash
# ...
# Containers:
#   nginx:
#     Container ID:   containerd://abc123
#     State:          Waiting
#       Reason:       CrashLoopBackOff
#     Last State:     Terminated
#       Reason:       Error
#       Exit Code:    1
#       Finished At:  2026-05-27T10:01:00Z
#     ...
# Events:
#   Type     Reason     Age                   From               Message
#   ----     ------     ----                  ----               -------
#   Normal   Scheduled  3m                    default-scheduler  Successfully assigned default/nginx-crash to worker-node1
#   Normal   Pulled     3m                    kubelet            Successfully pulled image "nginx:latest" in 2.345s
#   Normal   Created    3m                    kubelet            Created container nginx
#   Normal   Started    3m                    kubelet            Started container nginx
#   Warning  BackOff    15s (x5 over 2m40s)   kubelet            Back-off restarting failed container

Exit Code is 1, indicating the application process exited abnormally.

Step 3: View current instance logs

kubectl logs nginx-crash
# 2026/05/27 10:00:00 [emerg] 1#1: open() "/etc/nginx/nginx.conf" failed (2: No such file or directory)
# nginx: [emerg] open() "/etc/nginx/nginx.conf" failed (2: No such file or directory)

Found that Nginx cannot find the configuration file.

Step 4: View logs from the previous crashed instance (if needed)

kubectl logs nginx-crash --previous
# (Same as current logs, indicating consistent crash cause)

Step 5: Enter container to inspect configuration (test with a non-crashing Pod image)

# Since the container keeps crashing, use kubectl debug to create a debug copy
kubectl debug nginx-crash -it --image=nginx --copy-to=nginx-debug -- /bin/bash
# Or check inside a running debug container
kubectl exec -it nginx-debug -- ls -la /etc/nginx/
# Found missing nginx.conf file → configuration issue

Step 6: Fix the problem

# Check Deployment/Pod configuration, find the root cause
# The original Pod YAML may have mounted an incorrect ConfigMap that overwrote nginx.conf

# Method 1: Directly edit the Deployment to fix configuration
kubectl edit deployment nginx-crash
# Fix the mounted ConfigMap name or path

# Method 2: If ConfigMap content is wrong, edit the ConfigMap
kubectl edit configmap nginx-config
# Ensure it contains the correct nginx.conf content

Step 7: Verify Pod is running again

kubectl get pods -w
# nginx-crash               1/1     Running            0               30s
# → Pod has returned to normal running state

kubectl describe pod nginx-crash
# State:          Running
#   Started:      ...
# No more CrashLoopBackOff in Events

Verification

# Verify Pod status is stable
kubectl get pods nginx-crash
# NAME          READY   STATUS    RESTARTS   AGE
# nginx-crash   1/1     Running   0          1m

# Verify service responds normally
kubectl port-forward pod/nginx-crash 8080:80 &
curl http://localhost:8080
# <!DOCTYPE html>
# <html>...（Nginx homepage returns normally）

Exam Tips

CrashLoopBackOff troubleshooting order: kubectl describe → kubectl logs → kubectl logs --previous
Exit code meanings: 0=normal exit, 1=application error, 137=OOMKilled (SIGKILL), 143=graceful termination (SIGTERM)
kubectl logs --previous views logs before the crash, very valuable when the container restarts repeatedly
If the container starts too fast to catch logs, use kubectl debug to create a debug copy
Checking liveness/readiness probe configuration errors is also a common cause