Qingular

Node Troubleshooting

·CKAk8sPractice

CKA Exam Domain 5 — Node NotReady troubleshooting, kubelet inspection, system resource diagnosis, certificate handling

← Back to CKA Practice Index Nodes are the worker machines of a Kubernetes cluster. Node failures directly impact Pod operations. Node troubleshooting is a high-frequency topic in the CKA exam.


1. Node NotReady Status Troubleshooting Flow

# 1. Check node status
kubectl get nodes

# 2. View node details (look at the Conditions section)
kubectl describe node <node-name>

# 3. SSH into the problematic node
ssh <user>@<node-ip>

# 4. Check kubelet status
sudo systemctl status kubelet

# 5. View kubelet logs
sudo journalctl -u kubelet -n 100 --no-pager

# 6. Check the container runtime
sudo systemctl status containerd
# or
sudo systemctl status docker

Troubleshooting flowchart:

Node NotReady
    │
    ├─ SSH to node
    │
    ├─ systemctl status kubelet
    │   ├─ inactive → systemctl start kubelet
    │   └─ active → check logs
    │
    ├─ journalctl -u kubelet -n 50
    │   ├─ Certificate error → check certificates
    │   ├─ Network plugin error → check CNI
    │   └─ Insufficient resources → check system resources
    │
    ├─ Check disk space
    ├─ Check memory
    └─ Check container runtime

2. kubelet Status Check

# View kubelet service status
sudo systemctl status kubelet

# Start / Stop / Restart kubelet
sudo systemctl start kubelet
sudo systemctl stop kubelet
sudo systemctl restart kubelet

# Enable kubelet to start on boot
sudo systemctl enable kubelet

3. Viewing kubelet Logs

# View recent kubelet logs (recommended)
sudo journalctl -u kubelet -n 100 -f

# View logs from a specific time range
sudo journalctl -u kubelet --since "5 min ago"

# View all logs (paged)
sudo journalctl -u kubelet --no-pager

# Output logs to a file for analysis
sudo journalctl -u kubelet --no-pager > /tmp/kubelet.log

4. kubelet Configuration Check

# View kubelet configuration (kubeadm deployment)
kubectl get nodes -o wide
cat /var/lib/kubelet/config.yaml

# kubelet startup parameters
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# or
ps aux | grep kubelet

# Check kubelet certificates
ls /var/lib/kubelet/pki/

5. System Resource Diagnosis

Disk Space

# Check disk usage
df -h

# Check the /var directory (Docker/containerd image storage)
du -sh /var/lib/containerd/
du -sh /var/lib/docker/

# Clean up unused container images
docker image prune -a
# or
crictl rmi --prune

Memory Usage

# Check memory
free -h

# View memory-consuming processes
top
# or
htop

Docker / containerd Status

# containerd (newer versions)
sudo systemctl status containerd
sudo crictl ps

# Docker (older versions)
sudo systemctl status docker
sudo docker ps

6. Handling Expired Node Certificates

# Check certificate validity (kubeadm deployment)
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

# Renew certificates with kubeadm
sudo kubeadm certs renew all

# Update kubeconfig
sudo kubeadm init phase kubeconfig all

# Restart kubelet
sudo systemctl restart kubelet

# Check certificate expiration time
kubeadm certs check-expiration

7. kubectl describe node -- View Node Details

# Comprehensive node information view
kubectl describe node <node-name>

# Key areas to focus on:
# - Conditions: Ready, DiskPressure, MemoryPressure, PIDPressure
# - Capacity / Allocatable: CPU, Memory, Pod count
# - Non-terminated Pods: Pods running on this node
# - Events: Node-related events

Conditions explained:

ConditionDescription
ReadyWhether the node is healthy
DiskPressureWhether disk space is insufficient
MemoryPressureWhether memory is insufficient
PIDPressureWhether there are too many PIDs
NetworkUnavailableWhether the network is healthy

8. Node Recovery Steps

# Step 1: SSH to the node for diagnosis
ssh <user>@<node-ip>

# Step 2: Restart kubelet
sudo systemctl restart kubelet

# Step 3: Verify kubelet status
sudo systemctl status kubelet

# Step 4: Return to the master node and verify
kubectl get nodes
kubectl describe node <node-name>

# Step 5: If the node is still unavailable, try cordon/drain
kubectl cordon <node-name>        # Mark as unschedulable
kubectl drain <node-name> --ignore-daemonsets  # Evict Pods

9. Exam Key Points

  • When a node is NotReady, the first step is to SSH into the node
  • journalctl -u kubelet is the most important diagnostic command
  • A full disk (/var directory) is a common cause of failure
  • After certificate expiry, use kubeadm certs renew all to renew
  • The Condition fields in kubectl describe node are key to pinpointing issues
  • The exam environment does not support rebooting nodes; focus on kubelet restarts

🧪 Complete Hands-on Example: Troubleshooting a Node NotReady Failure

Scenario

Simulate a node entering a NotReady state and walk through the complete troubleshooting flow from viewing node status, SSHing into the node, checking kubelet logs, to final recovery.

Prerequisites

  • A cluster with a Master node and Worker nodes
  • SSH access to the Worker node
  • kubelet managed by systemd on the node

Steps

Step 1: Discover the Abnormal Node Status

kubectl get nodes
# NAME           STATUS     ROLES           AGE   VERSION
# master-node    Ready      control-plane   10d   v1.28.0
# worker-node1   NotReady   <none>          10d   v1.28.0

Step 2: View Node Details to Find Diagnostic Clues

kubectl describe node worker-node1
# ...
# Conditions:
#   Type                 Status  LastHeartbeatTime                 Reason
#   ----                 ------  -----------------                 ------
#   Ready                Unknown 2026-05-27T10:00:00Z              NodeStatusUnknown
#   ...
#   Message: Kubelet stopped posting node status.

Step 3: SSH into the Problem Node and Check kubelet Status

ssh worker-node1

sudo systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
#    Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
#    Active: inactive (dead)      ← kubelet is not running

Step 4: View kubelet Logs to Determine the Root Cause

sudo journalctl -u kubelet -n 50 --no-pager
# May 27 09:55:00 worker-node1 kubelet[1234]: E0527 09:55:00.123456    1234 kubelet.go:1234] "Failed to run kubelet" err="failed to run Kubelet: misconfiguration: kubelet cgroup driver: \"systemd\" is different from docker cgroup driver: \"cgroupfs\""
# May 27 09:55:00 worker-node1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE

Step 5: Check System Resources (Disk Space and Container Runtime)

# Check disk space
df -h
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        50G   12G   35G   26% /
# → Disk space is sufficient

# Check container runtime
sudo systemctl status containerd
# ● containerd.service - Container Runtime
#    Active: active (running)
# → Container runtime is normal

Step 6: Fix the Configuration and Restart kubelet

# Based on the logs, modify the cgroup driver configuration
# Edit the kubelet configuration file (this demo fixes and then launches directly)
sudo systemctl start kubelet

# Check startup status
sudo systemctl status kubelet
# ● kubelet.service - kubelet: The Kubernetes Node Agent
#    Active: active (running)    ← Now running

# Enable on boot (ensure it starts automatically after a reboot)
sudo systemctl enable kubelet

Step 7: Return to the Master Node and Verify Recovery

exit
# Back on the Master node

kubectl get nodes
# NAME           STATUS   ROLES           AGE   VERSION
# master-node    Ready    control-plane   10d   v1.28.0
# worker-node1   Ready    <none>          10d   v1.28.0
# → Node has recovered to normal

Verification

# Verify the node's Ready condition
kubectl get nodes worker-node1 -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
# True

# Verify kubelet is running normally
ssh worker-node1 'sudo systemctl is-active kubelet'
# active

# Confirm Pods on this node have recovered
kubectl get pods -o wide --field-selector spec.nodeName=worker-node1

Exam Tips

  • When a node is NotReady, the first step is to SSH into the node and check systemctl status kubelet
  • journalctl -u kubelet -n 50 is the most critical diagnostic command; it reveals the specific error messages
  • Common causes: kubelet not running, disk full (df -h), expired certificates, container runtime abnormal
  • After fixing, run systemctl restart kubelet, then return to the Master node and verify with kubectl get nodes
  • If the node remains NotReady, check the Conditions field of kubectl describe node for more information

Official Documentation