etcd 备份与恢复
etcd 是 Kubernetes 的核心数据存储,掌握 etcd 快照备份、恢复和成员管理是 CKA 考试的关键技能。
概述
etcd 是 Kubernetes 集群的键值存储数据库,保存了所有集群状态(Pod、Service、ConfigMap 等资源数据)。etcd 备份与恢复是 CKA 考试的重点实操内容,也是灾难恢复中的关键技能。
1. etcd 基础
1.1 etcd 架构在 Kubernetes 中的角色
┌─────────────────────────────────────────┐
│ API Server │
│ (唯一访问 etcd 的组件) │
└─────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ etcd │
│ ┌──────────┼──────────┐ │
│ │ member1 │ member2 │ member3 │
│ │ (leader) │(follower)│(follower) │
│ └──────────┴──────────┴──────────────┘
│ Raft Consensus Protocol │
│ 多数派(N/2+1)写入成功才返回 │
└─────────────────────────────────────────┘
1.2 etcd 关键目录与文件
# etcd 数据目录(默认)
/var/lib/etcd/
# etcd 配置文件(静态 Pod)
/etc/kubernetes/manifests/etcd.yaml
# etcd TLS 证书
/etc/kubernetes/pki/etcd/
├── ca.crt # etcd CA 证书
├── server.crt # etcd 服务端证书
├── server.key # etcd 服务端密钥
├── peer.crt # etcd 对等证书(集群通信)
├── peer.key # etcd 对等密钥
├── healthcheck-client.crt # 健康检查客户端证书
└── healthcheck-client.key # 健康检查客户端密钥
2. etcdctl 安装与配置
2.1 安装 etcdctl
# 方法一:从 kubeadm 控制平面节点直接使用
# etcdctl 通常已安装在控制平面节点上
which etcdctl
# 方法二:下载 etcd 二进制
wget https://github.com/etcd-io/etcd/releases/download/v3.5.15/etcd-v3.5.15-linux-amd64.tar.gz
tar xzvf etcd-v3.5.15-linux-amd64.tar.gz
sudo cp etcd-v3.5.15-linux-amd64/etcdctl /usr/local/bin/
# 设置环境变量(重要!)
export ETCDCTL_API=3
alias etcdctl='etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key'
2.2 TLS 连接参数
| 参数 | 说明 | 默认路径 |
|---|---|---|
--cacert | CA 证书(验证 etcd 服务端) | /etc/kubernetes/pki/etcd/ca.crt |
--cert | 客户端证书(身份认证) | /etc/kubernetes/pki/etcd/server.crt |
--key | 客户端密钥 | /etc/kubernetes/pki/etcd/server.key |
--endpoints | etcd 节点地址 | https://127.0.0.1:2379 |
# 为方便使用,设置别名
alias ectl='ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379'
3. etcd 快照备份
3.1 创建快照
# 基本备份命令
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 指定 endpoints(多 etcd 集群时选择其中一个即可)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M).db \
--endpoints=https://192.168.1.10:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 使用别名简化(如果已设置)
ectl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db
3.2 验证快照
# 查看快照状态
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db
# 输出示例:
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, total size, 是否已损坏)
# 详细状态查看
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db -w table
# 输出示例(tabular format):
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL SIZE | STATUS |
# +----------+----------+------------+------------+
# | 2f0e0b8 | 243850 | 1.8MB | ok/ corrupted |
# +----------+----------+------------+------------+
# 创建带日期的备份脚本
cat <<'EOF' > /usr/local/bin/backup-etcd.sh
#!/bin/bash
BACKUP_DIR="/backup/etcd"
mkdir -p $BACKUP_DIR
DATE=$(date +%Y%m%d-%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-snapshot-$DATE.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 保留最近 7 天备份
find $BACKUP_DIR -name "etcd-snapshot-*.db" -mtime +7 -delete
EOF
chmod +x /usr/local/bin/backup-etcd.sh
4. etcd 快照恢复
4.1 单个 etcd 节点恢复
# 完整恢复流程
# 1. 停止 API Server(很重要,防止恢复过程中数据写入)
# 将 etcd 静态 Pod 清单移出 manifests 目录
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 30 # 等待 Pod 停止
# 2. 备份当前数据目录
sudo mv /var/lib/etcd /var/lib/etcd.bak
# 3. 从快照恢复
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd
# 4. 设置正确的权限
sudo chown -R etcd:etcd /var/lib/etcd
# 5. 恢复静态 Pod 清单
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 6. 等待 Pod 启动
sleep 30
kubectl get pods -n kube-system | grep -E "etcd|kube-apiserver"
4.2 指定恢复参数
# snapshot restore 的可用参数
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored \
--name=etcd-0 \
--initial-cluster=etcd-0=https://192.168.1.10:2380 \
--initial-cluster-token=etcd-cluster \
--initial-advertise-peer-urls=https://192.168.1.10:2380
4.3 多节点 etcd 集群恢复
# 在每个 etcd 节点上执行恢复
# 节点 1(恢复后作为初始集群成员)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=etcd-1 \
--initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://192.168.1.10:2380
# 节点 2
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=etcd-2 \
--initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://192.168.1.11:2380
# 节点 3
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--name=etcd-3 \
--initial-cluster="etcd-1=https://192.168.1.10:2380,etcd-2=https://192.168.1.11:2380,etcd-3=https://192.168.1.12:2380" \
--initial-cluster-token=etcd-cluster-token \
--initial-advertise-peer-urls=https://192.168.1.12:2380
5. etcd 成员管理
5.1 查看成员
# 列出 etcd 集群成员
ETCDCTL_API=3 etcdctl member list \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 以表格格式查看
ETCDCTL_API=3 etcdctl member list -w table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 输出示例:
# +------------------+---------+--------+---------------------------+---------------------------+
# | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
# +------------------+---------+--------+---------------------------+---------------------------+
# | 8e9e05c52164694d | started | cp-1 | https://192.168.1.10:2380 | https://192.168.1.10:2379 |
# | 6a4d1c8352a47abd | started | cp-2 | https://192.168.1.11:2380 | https://192.168.1.11:2379 |
# | 4f2c7a9621c4a3ef | started | cp-3 | https://192.168.1.12:2380 | https://192.168.1.12:2379 |
# +------------------+---------+--------+---------------------------+---------------------------+
5.2 添加/移除成员
# 添加新成员
ETCDCTL_API=3 etcdctl member add etcd-4 \
--peer-urls=https://192.168.1.13:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 移除成员
ETCDCTL_API=3 etcdctl member remove <member-id> \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 更新成员
ETCDCTL_API=3 etcdctl member update <member-id> \
--peer-urls=https://192.168.1.14:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
5.3 健康检查
# 检查单个 etcd 端点健康
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 检查集群所有端点
ETCDCTL_API=3 etcdctl endpoint health --cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 查看端点状态(包括版本、DB 大小等)
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
6. 灾难恢复完整流程
6.1 完全损坏 -- 单节点 etcd
# 场景:唯一 etcd 节点数据完全损坏
# 1. 停止所有控制平面组件
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
sleep 30
# 2. 删除损坏数据
sudo rm -rf /var/lib/etcd
# 3. 从快照恢复
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd
# 4. 设置权限
sudo chown -R etcd:etcd /var/lib/etcd
# 5. 恢复控制平面组件
sudo mv /tmp/*.yaml /etc/kubernetes/manifests/
# 6. 验证恢复
sleep 60
kubectl get nodes
kubectl get pods --all-namespaces
6.2 多数 etcd 节点故障 -- HA 集群
# 场景:3 节点 etcd 集群中有 2 个节点不可恢复
# 1. 在幸存的 etcd 节点上备份
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-emergency.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 2. 在幸存节点上停止 etcd
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# 3. 使用 force-new-cluster 选项恢复
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-emergency.db \
--data-dir=/var/lib/etcd-new \
--force-new-cluster
# 4. 替换数据目录
sudo rm -rf /var/lib/etcd
sudo mv /var/lib/etcd-new /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
# 5. 恢复 etcd 静态 Pod
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
# 6. 逐个添加新的 etcd 成员
ETCDCTL_API=3 etcdctl member add new-member \
--peer-urls=https://192.168.1.14:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
7. 使用 kubeadm 检查 etcd
# kubeadm 也提供了 etcd 健康检查
sudo kubeadm init phase etcd local --config=/etc/kubernetes/kubeadm-config.yaml
# 查看 etcd Pod 日志
kubectl logs -n kube-system etcd-<node-name> --tail=100
# 进入 etcd Pod 内部
kubectl exec -n kube-system etcd-<node-name> -it -- sh
CKA 考试要点
- 必须设置
ETCDCTL_API=3-- 否则 etcdctl 默认使用 v2 API,无法使用 snapshot 功能 - TLS 证书参数 -- 考试中 etcdctl 必须指定
--cacert、--cert、--key - 恢复时必须先停止 API Server -- 移动 etcd 和 apiserver 的静态 Pod 清单
--data-dir指定恢复路径 -- 恢复后的数据目录需要与 etcd 配置一致- 恢复后设置权限 --
sudo chown -R etcd:etcd /var/lib/etcd
🧪 完整操作实例:etcd 备份与灾难恢复
场景描述
对 etcd 进行快照备份,然后模拟数据损坏场景,从快照恢复集群。
前置条件
- 具有对控制平面节点的 sudo 访问权限
- etcdctl 已安装(v3 API)
- etcd TLS 证书文件存在于
/etc/kubernetes/pki/etcd/
操作步骤
Step 1: 创建 etcd 快照备份
# 设置环境变量(重要:必须指定 API=3)
export ETCDCTL_API=3
# 创建备份目录
sudo mkdir -p /backup
# 执行快照备份
sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Snapshot saved at /backup/etcd-snapshot-20250527.db
Step 2: 验证快照文件
# 检查快照状态
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db
# 2f0e0b8, 243850, 1.8MB, false
# (hash, revision, size, 是否损坏: false = 正常)
# 以表格形式查看
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250527.db -w table
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL SIZE | STATUS |
# +----------+----------+------------+------------+
# | 2f0e0b8 | 243850 | 1.8MB | ok |
# +----------+----------+------------+------------+
Step 3: 模拟故障(停止 etcd 和 API Server)
# 将 etcd 和 API Server 的静态 Pod 清单移出 manifests 目录
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# 等待 Pod 完全停止
sleep 30
# 验证 etcd Pod 已停止
sudo crictl ps | grep etcd
# (无输出,表示 etcd 已停止)
# 删除当前 etcd 数据目录(模拟数据损坏)
sudo rm -rf /var/lib/etcd
Step 4: 从快照恢复
# 从快照恢复到数据目录
sudo ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20250527.db \
--data-dir=/var/lib/etcd
# 设置正确的权限
sudo chown -R etcd:etcd /var/lib/etcd
# 验证数据目录已恢复
ls -la /var/lib/etcd/
# total 24
# drwx------ 4 etcd etcd 4096 May 27 10:00 .
# drwxr-xr-x 3 root root 4096 May 27 10:00 ..
# drwx------ 3 etcd etcd 4096 May 27 10:00 member
Step 5: 恢复控制平面组件
# 将 etcd 和 API Server 清单移回
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 等待 Pod 启动(约 30-60 秒)
sleep 60
验证结果
# 验证 etcd Pod 运行
kubectl get pods -n kube-system | grep etcd
# etcd-control-plane-1 1/1 Running 0 1m
# 验证 API Server 运行
kubectl get pods -n kube-system | grep kube-apiserver
# kube-apiserver-control-plane-1 1/1 Running 0 1m
# 验证集群资源已恢复
kubectl get nodes
kubectl get pods --all-namespaces
# 恢复前的所有资源应可见
# 验证 etcd 健康
kubectl exec -n kube-system etcd-control-plane-1 -- etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
# https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.245672ms
考试提示
- 必须设置
ETCDCTL_API=3-- 忘记设置会导致 etcdctl 使用 v2 API,snapshot 命令不可用 - 恢复前必须先停止 API Server -- 防止恢复过程中有数据写入导致不一致
- TLS 证书参数不可省略 -- etcdctl 每个命令都需要指定
--cacert、--cert、--key - 恢复后设置权限 --
sudo chown -R etcd:etcd /var/lib/etcd不能忘,否则 etcd 无法启动 - 使用
-w table参数可以更清晰地查看 etcdctl 的输出