Qingular

Pod Troubleshooting

·CKAk8sPractice

CKA Exam Domain 5 — Common Pod troubleshooting, CrashLoopBackOff, ImagePullBackOff, Pending state

← Back to CKA Practice Index Pods are the smallest scheduling unit in Kubernetes. Pod failures are the most common troubleshooting scenario in the CKA exam.


1. Pod Status Quick Reference

StatusMeaning
PendingPod not yet scheduled, or image being pulled
RunningPod running normally
CrashLoopBackOffContainer repeatedly crashes and restarts
ImagePullBackOffImage pull failed
ErrImagePullImage pull error
OOMKilledContainer killed due to memory overrun
CreateContainerConfigErrorContainer configuration error (e.g., ConfigMap does not exist)
Init:Error / Init:CrashLoopBackOffInit container failed
TerminatingPod is terminating (may be stuck)

2. CrashLoopBackOff Troubleshooting

# 1. Check Pod status
kubectl get pods

# 2. View container logs
kubectl logs <pod-name>

# 3. View logs from the previous crashed instance
kubectl logs <pod-name> --previous

# 4. View Pod details (look for error reasons in the Events section)
kubectl describe pod <pod-name>

# 5. Enter the container for inspection
kubectl exec -it <pod-name> -- /bin/sh

Common Causes:

CauseTroubleshooting Method
Application code errorkubectl logs to check errors
Startup command failureCheck Dockerfile ENTRYPOINT / CMD
Configuration errorCheck ConfigMap / Secret mounts
Health check failureCheck liveness / readiness probe configuration
Port conflictCheck containerPort configuration

3. ImagePullBackOff / ErrImagePull Troubleshooting

# 1. View Pod details
kubectl describe pod <pod-name>

# Output will show something like:
# Failed to pull image "nginx:latst": rpc error: ...
# Error: ErrImagePull
# Back-off pulling image "nginx:latst"

Common Causes and Solutions:

CauseSolution
Image name typoCheck the image field, e.g., nginx:latst should be nginx:latest
Image tag does not existUse kubectl edit pod to modify the tag
Private registry not authenticatedCreate an ImagePullSecret
Registry unreachableCheck network connectivity
Image does not existConfirm image has been pushed to the registry

Private Registry Authentication:

# Create Docker registry Secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<pass> \
  --docker-email=<email>

# Reference in Pod
# spec:
#   imagePullSecrets:
#     - name: regcred

4. Pending State Troubleshooting

kubectl describe pod <pod-name>

The Events section will show the reason for scheduling failure:

ReasonSolution
0/1 nodes are available: Insufficient cpuInsufficient node CPU resources
0/1 nodes are available: Insufficient memoryInsufficient node memory resources
0/1 nodes are available: node(s) had taintNode has taints, toleration needed
0/1 nodes are available: pod has unbound PVCPVC not bound or does not exist
0/1 nodes are available: node(s) didn't match node selectorNode labels do not match

Check Resources:

# View node resource capacity
kubectl describe node <node-name>

# View node resource allocation
kubectl top node

# View Pod resource requests
kubectl get pod <pod-name> -o yaml | grep -A 5 resources

5. OOMKilled (Memory Overrun)

# Status is OOMKilled
kubectl get pod
# NAME    STATUS     RESTARTS
# my-pod  OOMKilled  5

# View logs (logs may be lost after container is OOM killed)
kubectl logs <pod-name> --previous

# View container exit reason
kubectl describe pod <pod-name>
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

Solutions:

# Increase memory limit
kubectl set resources pod <pod-name> --limits=memory=512Mi
# Or edit the Pod (Deployment)
kubectl edit deployment <deployment-name>
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

6. Init Container Failure

# View Init container status
kubectl describe pod <pod-name>

# View Init container logs
kubectl logs <pod-name> -c <init-container-name>

# View previous Init container logs
kubectl logs <pod-name> -c <init-container-name> --previous

Example:

spec:
  initContainers:
    - name: init-setup
      image: busybox
      command: ["sh", "-c", "echo 'init done'"]
  containers:
    - name: app
      image: nginx

7. Readiness / Liveness Probe Failure

kubectl describe pod <pod-name>

The Events section will show:

Warning  Unhealthy  3s (x5 over 30s)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 500
Warning  Unhealthy  10s (x3 over 50s)  kubelet  Readiness probe failed: Get "http://10.244.1.2:8080/healthz": dial tcp 10.244.1.2:8080: connect: connection refused

Troubleshooting Steps:

# 1. Confirm the application port
kubectl exec <pod-name> -- netstat -tlnp

# 2. Test the probe path
kubectl exec <pod-name> -- wget -qO- http://localhost:8080/healthz

# 3. Check probe configuration
kubectl get pod <pod-name> -o yaml | grep -A 15 livenessProbe

8. kubectl exec for Container Diagnostics

# Enter container shell
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec -it <pod-name> -- /bin/bash

# Execute commands in the container
kubectl exec <pod-name> -- ls /app
kubectl exec <pod-name> -- env
kubectl exec <pod-name> -- cat /etc/config/config.yaml

# Specify container (multi-container Pod)
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

9. kubectl debug for Temporary Debug Containers

Kubernetes v1.25+ supports temporary debug containers (Ephemeral Containers) via kubectl debug.

# Add a debug container to a running Pod
kubectl debug <pod-name> -it --image=busybox

# Copy a Pod and replace the image for debugging
kubectl debug <pod-name> -it --copy-to=<debug-name> --container=<container> --image=busybox

# Create a debug Pod for a node
kubectl debug node/<node-name> -it --image=busybox

10. General Troubleshooting Command Quick Reference

# Pod status overview
kubectl get pods -o wide
kubectl get pods --all-namespaces | grep -v Running

# View events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -w

# View full YAML
kubectl get pod <pod-name> -o yaml

# View all resource events
kubectl get events --all-namespaces

11. Exam Key Points

  • For CrashLoopBackOff, first check kubectl logs, then kubectl describe
  • ImagePullBackOff is usually caused by a typo in the image name
  • For Pending, check the scheduling failure reason in Events
  • The exit code for OOMKilled is 137
  • Init container logs are viewed with -c <container-name>
  • The Events section of kubectl describe pod is the most important diagnostic information

🧪 Complete Hands-On Example: Troubleshooting CrashLoopBackOff

Scenario

A Pod repeatedly crashes and restarts (CrashLoopBackOff). Walk through a complete troubleshooting process from start to finish, covering log inspection, configuration checking, fixing, and result verification.

Prerequisites

  • There is a Pod in CrashLoopBackOff state in the cluster
  • Permission to use kubectl logs and kubectl describe

Steps

Step 1: Identify the abnormal Pod

kubectl get pods
# NAME                      READY   STATUS             RESTARTS      AGE
# nginx-crash               0/1     CrashLoopBackOff   5 (15s ago)   2m
# web-app                   1/1     Running            0             10m

Step 2: View Pod details (look for clues in Events)

kubectl describe pod nginx-crash
# ...
# Containers:
#   nginx:
#     Container ID:   containerd://abc123
#     State:          Waiting
#       Reason:       CrashLoopBackOff
#     Last State:     Terminated
#       Reason:       Error
#       Exit Code:    1
#       Finished At:  2026-05-27T10:01:00Z
#     ...
# Events:
#   Type     Reason     Age                   From               Message
#   ----     ------     ----                  ----               -------
#   Normal   Scheduled  3m                    default-scheduler  Successfully assigned default/nginx-crash to worker-node1
#   Normal   Pulled     3m                    kubelet            Successfully pulled image "nginx:latest" in 2.345s
#   Normal   Created    3m                    kubelet            Created container nginx
#   Normal   Started    3m                    kubelet            Started container nginx
#   Warning  BackOff    15s (x5 over 2m40s)   kubelet            Back-off restarting failed container

Exit Code is 1, indicating the application process exited abnormally.

Step 3: View current instance logs

kubectl logs nginx-crash
# 2026/05/27 10:00:00 [emerg] 1#1: open() "/etc/nginx/nginx.conf" failed (2: No such file or directory)
# nginx: [emerg] open() "/etc/nginx/nginx.conf" failed (2: No such file or directory)

Found that Nginx cannot find the configuration file.

Step 4: View logs from the previous crashed instance (if needed)

kubectl logs nginx-crash --previous
# (Same as current logs, indicating consistent crash cause)

Step 5: Enter container to inspect configuration (test with a non-crashing Pod image)

# Since the container keeps crashing, use kubectl debug to create a debug copy
kubectl debug nginx-crash -it --image=nginx --copy-to=nginx-debug -- /bin/bash
# Or check inside a running debug container
kubectl exec -it nginx-debug -- ls -la /etc/nginx/
# Found missing nginx.conf file → configuration issue

Step 6: Fix the problem

# Check Deployment/Pod configuration, find the root cause
# The original Pod YAML may have mounted an incorrect ConfigMap that overwrote nginx.conf

# Method 1: Directly edit the Deployment to fix configuration
kubectl edit deployment nginx-crash
# Fix the mounted ConfigMap name or path

# Method 2: If ConfigMap content is wrong, edit the ConfigMap
kubectl edit configmap nginx-config
# Ensure it contains the correct nginx.conf content

Step 7: Verify Pod is running again

kubectl get pods -w
# nginx-crash               1/1     Running            0               30s
# → Pod has returned to normal running state

kubectl describe pod nginx-crash
# State:          Running
#   Started:      ...
# No more CrashLoopBackOff in Events

Verification

# Verify Pod status is stable
kubectl get pods nginx-crash
# NAME          READY   STATUS    RESTARTS   AGE
# nginx-crash   1/1     Running   0          1m

# Verify service responds normally
kubectl port-forward pod/nginx-crash 8080:80 &
curl http://localhost:8080
# <!DOCTYPE html>
# <html>...(Nginx homepage returns normally)

Exam Tips

  • CrashLoopBackOff troubleshooting order: kubectl describekubectl logskubectl logs --previous
  • Exit code meanings: 0=normal exit, 1=application error, 137=OOMKilled (SIGKILL), 143=graceful termination (SIGTERM)
  • kubectl logs --previous views logs before the crash, very valuable when the container restarts repeatedly
  • If the container starts too fast to catch logs, use kubectl debug to create a debug copy
  • Checking liveness/readiness probe configuration errors is also a common cause

Official Documentation