etcd Backup and Restore
etcd is the core data store of Kubernetes. Mastering etcd snapshot backup, restore, and member management is a key skill for the CKA exam.
Overview
etcd is the key-value store database for a Kubernetes cluster, storing all cluster state (Pod, Service, ConfigMap, and other resource data). etcd backup and restore is a key hands-on topic in the CKA exam and a critical skill for disaster recovery.
1. etcd Basics
1.1 etcd Architecture and Role in Kubernetes
┌─────────────────────────────────────────┐
│ API Server │
│ (The only component that accesses │
│ etcd) │
└─────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ etcd │
│ ┌──────────┼──────────┐ │
│ │ member1 │ member2 │ member3 │
│ │ (leader) │(follower)│(follower) │
│ └──────────┴──────────┴──────────────┘
│ Raft Consensus Protocol │
│ Majority (N/2+1) writes must succeed │
│ before returning │
└─────────────────────────────────────────┘
1.2 Key etcd Directories and Files
# etcd data directory (default)
/var/lib/etcd/
# etcd configuration file (static Pod)
/etc/kubernetes/manifests/etcd.yaml
# etcd TLS certificates
/etc/kubernetes/pki/etcd/
├── ca.crt # etcd CA certificate
├── server.crt # etcd server certificate
├── server.key # etcd server key
├── peer.crt # etcd peer certificate (cluster communication)
├── peer.key # etcd peer key
├── healthcheck-client.crt # Health check client certificate
└── healthcheck-client.key # Health check client key
2. etcdctl Installation and Configuration
2.1 Installing etcdctl
# Method 1: Use directly from a kubeadm control plane node
# etcdctl is usually already installed on control plane nodes
which etcdctl
# Method 2: Download the etcd binary
wget https://github.com/etcd-io/etcd/releases/download/v3.5.15/etcd-v3.5.15-linux-amd64.tar.gz
tar xzvf etcd-v3.5.15-linux-amd64.tar.gz
sudo cp etcd-v3.5.15-linux-amd64/etcdctl /usr/local/bin/
# Set environment variables (important!)
export ETCDCTL_API=3
alias etcdctl='etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key'
2.2 TLS Connection Parameters
| Parameter | Description | Default Path |
|---|---|---|
--cacert | CA certificate (verify etcd server) | /etc/kubernetes/pki/etcd/ca.crt |
--cert | Client certificate (authentication) | /etc/kubernetes/pki/etcd/server.crt |
--key | Client key | /etc/kubernetes/pki/etcd/server.key |
--endpoints | etcd node addresses | https://127.0.0.1:2379 |
# Set an alias for convenience
alias ectl='ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379'
3. etcd Snapshot Backup
3.1 Creating a Snapshot
# Basic backup command
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Specify endpoints (choose one for multi-etcd clusters)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M).db \
--endpoints=https://192.168.1.10:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Use alias to simplify (if already set)
ectl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db
3.2 Verifying a Snapshot
# View snapshot status
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db
# Output example:
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, total size, corrupted)
# Detailed status view
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db -w table
# Output example (tabular format):
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL SIZE | STATUS |
# +----------+----------+------------+------------+
# | 2f0e0b8 | 243850 | 1.8MB | ok/ corrupted |
# +----------+----------+------------+------------+
# Create a dated backup script
cat <<'EOF' > /usr/local/bin/backup-etcd.sh
#!/bin/bash
BACKUP_DIR="/backup/etcd"
mkdir -p $BACKUP_DIR
DATE=$(date +%Y%m%d-%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-snapshot-$DATE.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Keep only the last 7 days of backups
find $BACKUP_DIR -name "etcd-snapshot-*.db" -mtime +7 -delete
EOF
chmod +x /usr/local/bin/backup-etcd.sh
4. etcd Snapshot Restore
4.1 Single etcd Node Restore
# Complete restore process
# 1. Stop the API Server (important to prevent data writes during restore)
# Move the etcd static Pod manifest out of the manifests directory
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 30 # Wait for Pods to stop
# 2. Back up the current data directory
sudo mv /var/lib/etcd /var/lib/etcd.bak
# 3. Restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd
# 4. Set correct permissions
sudo chown -R etcd:etcd /var/lib/etcd
# 5. Restore static Pod manifests
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 6. Wait for Pods to start
sleep 30
kubectl get pods -n kube-system | grep -E "etcd|kube-apiserver"
4.2 Specifying Restore Parameters
# Available parameters for snapshot restore
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-0 \
--initial-cluster=etcd-0=https://192.168.1.10:2380 \
--initial-cluster-token=etcd-cluster \
--initial-advertise-peer-urls=https://192.168.1.10:2380
4.3 Multi-Node etcd Cluster Restore
# Perform restore on each etcd node
# Node 1 (restored as initial cluster member)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=etcd-1 \
--initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# Node 2
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=etcd-2 \
--initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://192.168.1.11:2380
# Node 3
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=etcd-3 \
--initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://192.168.1.12:2380
5. etcd Member Management
5.1 Viewing Members
# List etcd cluster members
ETCDCTL_API=3 etcdctl member list \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# View in table format
ETCDCTL_API=3 etcdctl member list -w table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Output example:
# +------------------+---------+--------+---------------------------+---------------------------+
# | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
# +------------------+---------+--------+---------------------------+---------------------------+
# | 8e9e05c52164694d | started | cp-1 | https://192.168.1.10:2380 | https://192.168.1.10:2379 |
# | 6a4d1c8352a47abd | started | cp-2 | https://192.168.1.11:2380 | https://192.168.1.11:2379 |
# | 4f2c7a9621c4a3ef | started | cp-3 | https://192.168.1.12:2380 | https://192.168.1.12:2379 |
# +------------------+---------+--------+---------------------------+---------------------------+
5.2 Adding/Removing Members
# Add a new member
ETCDCTL_API=3 etcdctl member add etcd-4 \
--peer-urls=https://192.168.1.13:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Remove a member
ETCDCTL_API=3 etcdctl member remove <member-id> \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Update a member
ETCDCTL_API=3 etcdctl member update <member-id> \
--peer-urls=https://192.168.1.14:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
5.3 Health Check
# Check the health of a single etcd endpoint
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check all cluster endpoints
ETCDCTL_API=3 etcdctl endpoint health --cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# View endpoint status (including version, DB size, etc.)
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
6. Complete Disaster Recovery Workflow
6.1 Complete Corruption -- Single-Node etcd
# Scenario: The only etcd node's data is completely corrupted
# 1. Stop all control plane components
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
sleep 30
# 2. Delete corrupted data
sudo rm -rf /var/lib/etcd
# 3. Restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd
# 4. Set permissions
sudo chown -R etcd:etcd /var/lib/etcd
# 5. Restore control plane components
sudo mv /tmp/*.yaml /etc/kubernetes/manifests/
# 6. Verify restoration
sleep 60
kubectl get nodes
kubectl get pods --all-namespaces
6.2 Majority etcd Node Failure -- HA Cluster
# Scenario: 2 out of 3 etcd nodes in the cluster are unrecoverable
# 1. Back up on the surviving etcd node
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-emergency.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 2. Stop etcd on the surviving node
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# 3. Restore using the force-new-cluster option
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-emergency.db \
--data-dir=/var/lib/etcd-new \
--force-new-cluster
# 4. Replace the data directory
sudo rm -rf /var/lib/etcd
sudo mv /var/lib/etcd-new /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
# 5. Restore the etcd static Pod
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# 6. Add new etcd members one by one
ETCDCTL_API=3 etcdctl member add new-member \
--peer-urls=https://192.168.1.14:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
7. Checking etcd with kubeadm
# kubeadm also provides etcd health checks
sudo kubeadm init phase etcd local --config=/etc/kubernetes/kubeadm-config.yaml
# View etcd Pod logs
kubectl logs -n kube-system etcd-<node-name> --tail=100
# Enter the etcd Pod
kubectl exec -n kube-system etcd-<node-name> -it -- sh
CKA Exam Key Points
- Must set
ETCDCTL_API=3-- otherwise etcdctl defaults to the v2 API and snapshot functionality is unavailable - TLS certificate parameters -- In the exam, etcdctl must specify
--cacert,--cert,--key - Must stop the API Server before restoring -- Move the etcd and apiserver static Pod manifests
--data-dirspecifies the restore path -- The restored data directory must match the etcd configuration- Set permissions after restore --
sudo chown -R etcd:etcd /var/lib/etcd
🧪 Complete Hands-on Example: etcd Backup and Disaster Recovery
Scenario Description
Take an etcd snapshot backup, then simulate a data corruption scenario and restore the cluster from the snapshot.
Prerequisites
- sudo access to the control plane node
- etcdctl installed (v3 API)
- etcd TLS certificate files present in
/etc/kubernetes/pki/etcd/
Steps
Step 1: Create an etcd snapshot backup
# Set environment variable (important: must specify API=3)
export ETCDCTL_API=3
# Create backup directory
sudo mkdir -p /backup
# Execute snapshot backup
sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Snapshot saved at /backup/etcd-snapshot-20250527.db
Step 2: Verify the snapshot file
# Check snapshot status
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, size, corrupted: false = normal)
# View in table format
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db -w table
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL SIZE | STATUS |
# +----------+----------+------------+------------+
# | 2f0e0b8 | 243850 | 1.8MB | ok |
# +----------+----------+------------+------------+
Step 3: Simulate a failure (stop etcd and API Server)
# Move etcd and API Server static Pod manifests out of the manifests directory
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Wait for Pods to completely stop
sleep 30
# Verify etcd Pod has stopped
sudo crictl ps | grep etcd
# (no output, meaning etcd has stopped)
# Delete the current etcd data directory (simulate data corruption)
sudo rm -rf /var/lib/etcd
Step 4: Restore from snapshot
# Restore from snapshot to the data directory
sudo ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20250527.db \
--data-dir=/var/lib/etcd
# Set correct permissions
sudo chown -R etcd:etcd /var/lib/etcd
# Verify data directory has been restored
ls -la /var/lib/etcd/
# total 24
# drwx------ 4 etcd etcd 4096 May 27 10:00 .
# drwxr-xr-x 3 root root 4096 May 27 10:00 ..
# drwx------ 3 etcd etcd 4096 May 27 10:00 member
Step 5: Restore control plane components
# Move etcd and API Server manifests back
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# Wait for Pods to start (approximately 30-60 seconds)
sleep 60
Verification
# Verify etcd Pod is running
kubectl get pods -n kube-system | grep etcd
# etcd-control-plane-1 1/1 Running 0 1m
# Verify API Server is running
kubectl get pods -n kube-system | grep kube-apiserver
# kube-apiserver-control-plane-1 1/1 Running 0 1m
# Verify cluster resources have been restored
kubectl get nodes
kubectl get pods --all-namespaces
# All resources from before the restore should be visible
# Verify etcd health
kubectl exec -n kube-system etcd-control-plane-1 -- etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.245672ms
Exam Tips
- Must set
ETCDCTL_API=3-- forgetting to set this causes etcdctl to use v2 API, making the snapshot command unavailable - Must stop the API Server before restoring -- prevents data writes during restore that would cause inconsistency
- TLS certificate parameters cannot be omitted -- every etcdctl command needs
--cacert,--cert,--key - Set permissions after restore --
sudo chown -R etcd:etcd /var/lib/etcdmust not be forgotten, otherwise etcd cannot start - Use the
-w tableparameter for clearer etcdctl output