Node Troubleshooting

2026-05-27·CKA k8s Practice

CKA Exam Domain 5 — Node NotReady troubleshooting, kubelet inspection, system resource diagnosis, certificate handling

← Back to CKA Practice Index Nodes are the worker machines of a Kubernetes cluster. Node failures directly impact Pod operations. Node troubleshooting is a high-frequency topic in the CKA exam.

1. Node NotReady Status Troubleshooting Flow

# 1. Check node status
kubectl get nodes

# 2. View node details (look at the Conditions section)
kubectl describe node <node-name>

# 3. SSH into the problematic node
ssh <user>@<node-ip>

# 4. Check kubelet status
sudo systemctl status kubelet

# 5. View kubelet logs
sudo journalctl -u kubelet -n 100 --no-pager

# 6. Check the container runtime
sudo systemctl status containerd
# or
sudo systemctl status docker

Troubleshooting flowchart:

Node NotReady
    │
    ├─ SSH to node
    │
    ├─ systemctl status kubelet
    │   ├─ inactive → systemctl start kubelet
    │   └─ active → check logs
    │
    ├─ journalctl -u kubelet -n 50
    │   ├─ Certificate error → check certificates
    │   ├─ Network plugin error → check CNI
    │   └─ Insufficient resources → check system resources
    │
    ├─ Check disk space
    ├─ Check memory
    └─ Check container runtime

2. kubelet Status Check

# View kubelet service status
sudo systemctl status kubelet

# Start / Stop / Restart kubelet
sudo systemctl start kubelet
sudo systemctl stop kubelet
sudo systemctl restart kubelet

# Enable kubelet to start on boot
sudo systemctl enable kubelet

3. Viewing kubelet Logs

# View recent kubelet logs (recommended)
sudo journalctl -u kubelet -n 100 -f

# View logs from a specific time range
sudo journalctl -u kubelet --since "5 min ago"

# View all logs (paged)
sudo journalctl -u kubelet --no-pager

# Output logs to a file for analysis
sudo journalctl -u kubelet --no-pager > /tmp/kubelet.log

4. kubelet Configuration Check

# View kubelet configuration (kubeadm deployment)
kubectl get nodes -o wide
cat /var/lib/kubelet/config.yaml

# kubelet startup parameters
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# or
ps aux | grep kubelet

# Check kubelet certificates
ls /var/lib/kubelet/pki/

5. System Resource Diagnosis

Disk Space

# Check disk usage
df -h

# Check the /var directory (Docker/containerd image storage)
du -sh /var/lib/containerd/
du -sh /var/lib/docker/

# Clean up unused container images
docker image prune -a
# or
crictl rmi --prune

Memory Usage

# Check memory
free -h

# View memory-consuming processes
top
# or
htop

Docker / containerd Status

# containerd (newer versions)
sudo systemctl status containerd
sudo crictl ps

# Docker (older versions)
sudo systemctl status docker
sudo docker ps

6. Handling Expired Node Certificates

# Check certificate validity (kubeadm deployment)
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

# Renew certificates with kubeadm
sudo kubeadm certs renew all

# Update kubeconfig
sudo kubeadm init phase kubeconfig all

# Restart kubelet
sudo systemctl restart kubelet

# Check certificate expiration time
kubeadm certs check-expiration

7. kubectl describe node -- View Node Details

# Comprehensive node information view
kubectl describe node <node-name>

# Key areas to focus on:
# - Conditions: Ready, DiskPressure, MemoryPressure, PIDPressure
# - Capacity / Allocatable: CPU, Memory, Pod count
# - Non-terminated Pods: Pods running on this node
# - Events: Node-related events

Conditions explained:

Condition	Description
`Ready`	Whether the node is healthy
`DiskPressure`	Whether disk space is insufficient
`MemoryPressure`	Whether memory is insufficient
`PIDPressure`	Whether there are too many PIDs
`NetworkUnavailable`	Whether the network is healthy

8. Node Recovery Steps

# Step 1: SSH to the node for diagnosis
ssh <user>@<node-ip>

# Step 2: Restart kubelet
sudo systemctl restart kubelet

# Step 3: Verify kubelet status
sudo systemctl status kubelet

# Step 4: Return to the master node and verify
kubectl get nodes
kubectl describe node <node-name>

# Step 5: If the node is still unavailable, try cordon/drain
kubectl cordon <node-name>        # Mark as unschedulable
kubectl drain <node-name> --ignore-daemonsets  # Evict Pods

9. Exam Key Points

When a node is NotReady, the first step is to SSH into the node
journalctl -u kubelet is the most important diagnostic command
A full disk (/var directory) is a common cause of failure
After certificate expiry, use kubeadm certs renew all to renew
The Condition fields in kubectl describe node are key to pinpointing issues
The exam environment does not support rebooting nodes; focus on kubelet restarts

🧪 Complete Hands-on Example: Troubleshooting a Node NotReady Failure

Scenario

Simulate a node entering a NotReady state and walk through the complete troubleshooting flow from viewing node status, SSHing into the node, checking kubelet logs, to final recovery.

Prerequisites

A cluster with a Master node and Worker nodes
SSH access to the Worker node
kubelet managed by systemd on the node

Steps

Step 1: Discover the Abnormal Node Status

kubectl get nodes
# NAME           STATUS     ROLES           AGE   VERSION
# master-node    Ready      control-plane   10d   v1.28.0
# worker-node1   NotReady   <none>          10d   v1.28.0

Step 2: View Node Details to Find Diagnostic Clues

kubectl describe node worker-node1
# ...
# Conditions:
#   Type                 Status  LastHeartbeatTime                 Reason
#   ----                 ------  -----------------                 ------
#   Ready                Unknown 2026-05-27T10:00:00Z              NodeStatusUnknown
#   ...
#   Message: Kubelet stopped posting node status.

Step 3: SSH into the Problem Node and Check kubelet Status

ssh worker-node1

sudo systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
#    Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
#    Active: inactive (dead)      ← kubelet is not running

Step 4: View kubelet Logs to Determine the Root Cause

sudo journalctl -u kubelet -n 50 --no-pager
# May 27 09:55:00 worker-node1 kubelet[1234]: E0527 09:55:00.123456    1234 kubelet.go:1234] "Failed to run kubelet" err="failed to run Kubelet: misconfiguration: kubelet cgroup driver: \"systemd\" is different from docker cgroup driver: \"cgroupfs\""
# May 27 09:55:00 worker-node1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE

Step 5: Check System Resources (Disk Space and Container Runtime)

# Check disk space
df -h
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        50G   12G   35G   26% /
# → Disk space is sufficient

# Check container runtime
sudo systemctl status containerd
# ● containerd.service - Container Runtime
#    Active: active (running)
# → Container runtime is normal

Step 6: Fix the Configuration and Restart kubelet

# Based on the logs, modify the cgroup driver configuration
# Edit the kubelet configuration file (this demo fixes and then launches directly)
sudo systemctl start kubelet

# Check startup status
sudo systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
#    Active: active (running)    ← Now running

# Enable on boot (ensure it starts automatically after a reboot)
sudo systemctl enable kubelet

Step 7: Return to the Master Node and Verify Recovery

exit
# Back on the Master node

kubectl get nodes
# NAME           STATUS   ROLES           AGE   VERSION
# master-node    Ready    control-plane   10d   v1.28.0
# worker-node1   Ready    <none>          10d   v1.28.0
# → Node has recovered to normal

Verification

# Verify the node's Ready condition
kubectl get nodes worker-node1 -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
# True

# Verify kubelet is running normally
ssh worker-node1 'sudo systemctl is-active kubelet'
# active

# Confirm Pods on this node have recovered
kubectl get pods -o wide --field-selector spec.nodeName=worker-node1

Exam Tips

When a node is NotReady, the first step is to SSH into the node and check systemctl status kubelet
journalctl -u kubelet -n 50 is the most critical diagnostic command; it reveals the specific error messages
Common causes: kubelet not running, disk full (df -h), expired certificates, container runtime abnormal
After fixing, run systemctl restart kubelet, then return to the Master node and verify with kubectl get nodes
If the node remains NotReady, check the Conditions field of kubectl describe node for more information