etcd(https://etcd.io/)是k8s的数据库,内部采用了raft协议作为一致性算法。etcd里面存放了k8s所有相关的数据,比如pod信息 service信息等
etcd存储数据的类型 /a/b/c/d value 路径 值
etcd属性
完全复制:集群中每个节点都拥有完成的文档
高可用性:多节点集群
一致性:raft算法实现一致性 选举leader
安全:支持TLS身份验证
快速:每秒万次的写入速度
etcd硬件配置官方建议https://etcd.io/docs/v3.5/op-guide/hardware/
- 8c 8g 数据盘ssd 数百个pod的集群
- 8c 16g 数据盘ssd 数千个pod的集群
- 16c 32g 数据盘ssd 数万个pod的集群
看一下etcd的service文件
root@etcd-2:~# vi /etc/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos
[Service]
Type=notify
WorkingDirectory=/var/lib/etcd/
ExecStart=/usr/local/bin//etcd \
--name=etcd-192.168.10.108 \
--cert-file=/etc/kubernetes/ssl/etcd.pem \
--key-file=/etc/kubernetes/ssl/etcd-key.pem \
--peer-cert-file=/etc/kubernetes/ssl/etcd.pem \
--peer-key-file=/etc/kubernetes/ssl/etcd-key.pem \
--trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
--peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
--initial-advertise-peer-urls=https://192.168.10.108:2380 \
--listen-peer-urls=https://192.168.10.108:2380 \
--listen-client-urls=https://192.168.10.108:2379,http://127.0.0.1:2379 \
--advertise-client-urls=https://192.168.10.108:2379 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=etcd-192.168.10.107=https://192.168.10.107:2380,etcd-192.168.10.108=https://192.168.10.108:2380,etcd-192.168.10.109=https://192.168.10.109:2380 \
--initial-cluster-state=new \
--data-dir=/var/lib/etcd \
--wal-dir= \
--snapshot-count=50000 \
--auto-compaction-retention=1 \
--auto-compaction-mode=periodic \
--max-request-bytes=10485760 \
--quota-backend-bytes=8589934592
Restart=always
RestartSec=15
LimitNOFILE=65536
OOMScoreAdjust=-999
[Install]
查看etcd的数据目录
root@etcd-2:~# ll /var/lib/etcd/
total 4
drwx------ 3 root root 20 Apr 19 17:44 ./
drwxr-xr-x 43 root root 4096 Apr 20 06:11 ../
drwx------ 4 root root 29 Apr 19 17:44 member/
root@etcd-2:~# ll /var/lib/etcd/member/
total 0
drwx------ 4 root root 29 Apr 19 17:44 ./
drwx------ 3 root root 20 Apr 19 17:44 ../
drwx------ 2 root root 246 Apr 21 17:36 snap/
drwx------ 2 root root 199 Apr 21 10:13 wal/
snap 存放的是数据
wal 存放的是预写式日志(在插入数据的时候先写日志在写数据,如果日志没写成功那么数据也就没插入成功,后期通过日志恢复数据)
查看etcd集群的节点有哪些
root@etcd-1:~# etcdctl member list
71745e1fe53ea3d2, started, etcd-192.168.10.107, https://192.168.10.107:2380, https://192.168.10.107:2379, false
b3497c3662525c94, started, etcd-192.168.10.108, https://192.168.10.108:2380, https://192.168.10.108:2379, false
cff05c5d2e5d7019, started, etcd-192.168.10.109, https://192.168.10.109:2380, https://192.168.10.109:2379, false
cff05c5d2e5d7019, started, etcd-192.168.10.109, https://192.168.10.109:2380, https://192.168.10.109:2379, false
id , 状态,名称,集群端口,客户端端口,是否在同步数据
etcd的健康检查
此种方法适用于本机
root@etcd-1:~# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.52306ms
集群监控可以写一个for循环
root@etcd-1:~# export NODE_IPS="192.168.10.107 192.168.10.108 192.168.10.109"
root@etcd-1:~# for ip in ${NODE_IPS}; do ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health; done
https://192.168.10.107:2379 is healthy: successfully committed proposal: took = 15.070727ms
https://192.168.10.108:2379 is healthy: successfully committed proposal: took = 9.874537ms
https://192.168.10.109:2379 is healthy: successfully committed proposal: took = 8.872484ms
以表格形式输出
oot@etcd-1:~# for ip in ${NODE_IPS}; do ETCDCTL_API=3 /usr/local/bin/etcdctl --write-out=table endpoint status --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health; done
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.10.107:2379 | 71745e1fe53ea3d2 | 3.4.13 | 2.7 MB | false | false | 4 | 521493 | 521493 | |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.10.108:2379 | b3497c3662525c94 | 3.4.13 | 2.7 MB | false | false | 4 | 521493 | 521493 | |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.10.109:2379 | cff05c5d2e5d7019 | 3.4.13 | 2.7 MB | true | false | 4 | 521493 | 521493 | |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
etcd的基础操作
查看etc中所有的key
root@etcd-1:~# etcdctl get / --prefix --keys-only
/calico/ipam/v2/assignment/ipv4/block/10.200.205.192-26
/calico/ipam/v2/assignment/ipv4/block/10.200.247.0-26
/calico/ipam/v2/assignment/ipv4/block/10.200.39.0-26
/calico/ipam/v2/assignment/ipv4/block/10.200.84.128-26
/calico/ipam/v2/handle/ipip-tunnel-addr-master-1
/calico/ipam/v2/handle/ipip-tunnel-addr-master-2
/calico/ipam/v2/handle/ipip-tunnel-addr-node-1
/calico/ipam/v2/handle/ipip-tunnel-addr-node-2
/calico/ipam/v2/handle/k8s-pod-network.3844a5799fbfdd20ab3ee16c6b176626d04c635ed8aa57a36d9e43a11b028713
/calico/ipam/v2/handle/k8s-pod-network.52d7e2ca8546bf0739c79c425ea421c63be1653fe74811c2d4b6c9242111fb22
/calico/ipam/v2/handle/k8s-pod-network.ab0b92bfc89fef7eb4486080bff1aa4e6f28109a105a70aceafb325d1d514d23
/calico/ipam/v2/handle/k8s-pod-network.c8dc5605cd5ed0a43a6169cf74d2f1738ddde1d5e72f2b4bd0cbfffe14a1232e
查看某个pod的key
net-test1为pod名字
root@master-1:~# kubectl get pod
NAME READY STATUS RESTARTS AGE
net-test1 1/1 Running 0 2d21h
net-test2 1/1 Running 0 2d21h
root@etcd-1:~# etcdctl get / --prefix --keys-only|grep net-test1
/registry/pods/default/net-test1
查看/registry/pods/default/net-test1这个key的信息
root@etcd-1:~# etcdctl get /registry/pods/default/net-test1
/registry/pods/default/net-test1
k8s
v1Podې
net-test1default"*$d9e53134-3638-4e51-bb43-b944037bd5652¯䏚
run net-test1z
kubectl-runUpdatev¯FieldsV1:
{"f:metadata":{"f:labels":{".":{},"f:run":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"net-test1\"}":{".":{},"f:args":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}·
kubeletUpdatev®¯FieldsV1:
{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.200.84.129\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}«
kube-api-access-tc2lvkЁh
"
token
(&
kube-root-ca.crt
ca.crtca.crt
)'
%
namespace
v1metadata.namespace¤±
net-test1centos:7.9.2009"sleep"300000*BJL
kube-api-access-tc2lv-/var/run/secrets/kubernetes.io/serviceaccount"2j/dev/termination-logr
IfNotPresent¢FileAlways 2
ClusterFirstBdefaultJdefaultR192.168.10.104X`hrdefault-scheduler²6
node.kubernetes.io/not-readyExists" NoExecute(¬²8
node.kubernetes.io/unreachableExists" NoExecute(¬ƁPreemptLowerPriorityȃ
Running#
InitializedTru¯䎪2
ReadyTru®¯䎪2'
ContainersReadyTru®¯䎪2$
10.200.84.12¯䏂݁u¯䎪2"*192.168.10.1042
net-test1
®¯䎚 (2centos:7.9.2009:`docker-pullable://centos@sha256:9d4bcbbb213dfd745b58be38b13b996ebb5ac315fe75711bd618426a630e0987BIdocker://d261d1933b0740fb2d478d4248371b89cbb95422fc75b05b88e3b7f032e6c818HJ
BestEffortZb
10.200.84.129"
删除pod就是删除他所对应的key
root@etcd-1:~# etcdctl del /registry/pods/default/net-test1
1
root@master-1:~# kubectl get pod
NAME READY STATUS RESTARTS AGE
net-test2 1/1 Running 0 2d21h
此时看到已经没有了net-test1这个pod,此项操作十分危险,请谨慎操作
上传数据
root@etcd-1:~# etcdctl put /qijia "0324"
OK
root@etcd-1:~# etcdctl get /qijia
/qijia
0324
etcd数据的watch机制
机遇不间断监控数据,发生变化就主动处罚通知客户端
测试一下etcd的watch机制
右边我上传key的值左边的watch窗口就可以实时获取到,更新完key值后也会被watch实时获取
etcd单机备份和恢复
备份
root@etcd-1:~# etcdctl snapshot save /data/backup/etcd-backup-`date +%F%H%M`
{"level":"info","ts":1650619578.2251773,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/data/backup/etcd-backup-2022-04-221726.part"}
{"level":"info","ts":"2022-04-22T17:26:18.225+0800","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1650619578.2258816,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2022-04-22T17:26:18.245+0800","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1650619578.2582552,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"2.7 MB","took":0.032995421}
{"level":"info","ts":1650619578.258354,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/data/backup/etcd-backup-2022-04-221726"}
Snapshot saved at /data/backup/etcd-backup-2022-04-221726
恢复数据 /data/etcd 一定要是空目录 etcd是自动创建的无需自己创建
root@etcd-1:~# etcdctl snapshot restore /data/backup/etcd-backup-2022-04-221726 --data-dir=/data/etcd
{"level":"info","ts":1650619683.6522489,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"/data/backup/etcd-backup-2022-04-221726","wal-dir":"/data/etcd/member/wal","data-dir":"/data/etcd","snap-dir":"/data/etcd/member/snap"}
{"level":"info","ts":1650619683.6754777,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":421636}
{"level":"info","ts":1650619683.682951,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1650619683.6885314,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"/data/backup/etcd-backup-2022-04-221726","wal-dir":"/data/etcd/member/wal","data-dir":"/data/etcd","snap-dir":"/data/etcd/member/snap"}
root@etcd-1:~# ll /data/etcd/
total 0
drwx------ 3 root root 20 Apr 22 17:28 ./
drwxr-xr-x 4 root root 32 Apr 22 17:28 ../
drwx------ 4 root root 29 Apr 22 17:28 member/
root@etcd-1:~# ll /data/etcd/member/
total 0
drwx------ 4 root root 29 Apr 22 17:28 ./
drwx------ 3 root root 20 Apr 22 17:28 ../
drwx------ 2 root root 62 Apr 22 17:28 snap/
drwx------ 2 root root 51 Apr 22 17:28 wal/
此时数据已经恢复在了/data/etcd目录中 我们只需要将etcd.service文件中的 WorkingDirectory和 --data-dir改成恢复后的目录并重启etcd就可以
root@etcd-1:~# vi /etc/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos
[Service]
Type=notify
WorkingDirectory=/var/lib/etcd/
ExecStart=/usr/local/bin//etcd \
--name=etcd-192.168.10.107 \
--cert-file=/etc/kubernetes/ssl/etcd.pem \
--key-file=/etc/kubernetes/ssl/etcd-key.pem \
--peer-cert-file=/etc/kubernetes/ssl/etcd.pem \
--peer-key-file=/etc/kubernetes/ssl/etcd-key.pem \
--trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
--peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
--initial-advertise-peer-urls=https://192.168.10.107:2380 \
--listen-peer-urls=https://192.168.10.107:2380 \
--listen-client-urls=https://192.168.10.107:2379,http://127.0.0.1:2379 \
--advertise-client-urls=https://192.168.10.107:2379 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=etcd-192.168.10.107=https://192.168.10.107:2380,etcd-192.168.10.108=https://192.168.10.108:2380,etcd-192.168.10.109=https://192.168.10.109:2380 \
--initial-cluster-state=new \
--data-dir=/var/lib/etcd \
--wal-dir= \
--snapshot-count=50000 \
--auto-compaction-retention=1 \
--auto-compaction-mode=periodic \
--max-request-bytes=10485760 \
--quota-backend-bytes=8589934592
Restart=always
RestartSec=15
LimitNOFILE=65536
OOMScoreAdjust=-999
[Install]
WantedBy=multi-user.target
etcd集群备份和还原
因为我们的k8s集群是通过kubeasz工具安装的,所以我们可以通过kubeasz自带的脚本来备份和恢复etcd的数据
所需要的脚本如下
94.backup.yml 备份脚本
95.restore.yml 恢复脚本
root@master-1:~# ll /etc/kubeasz/playbooks/
total 92
-rw-rw-r-- 1 root root 1786 Apr 26 2021 94.backup.yml
-rw-rw-r-- 1 root root 999 Apr 26 2021 95.restore.yml
备份 ezctl backup k8s集群名
root@master-1:~# ezctl --help
backup <cluster> to backup the cluster state (etcd snapshot)
restore <cluster> to restore the cluster state from backups
我们的集群名为qijia01
root@master-1:~# ll /etc/kubeasz/clusters/
total 0
drwxr-xr-x 3 root root 21 Feb 23 16:22 ./
drwxrwxr-x 12 root root 225 Feb 23 16:22 ../
drwxr-xr-x 5 root root 203 Apr 20 18:44 qijia01/
开始备份
root@master-1:~# ezctl backup qijia01
ansible-playbook -i clusters/qijia01/hosts -e @clusters/qijia01/config.yml playbooks/94.backup.yml
2022-04-22 18:02:54 INFO cluster:qijia01 backup begins in 5s, press any key to abort:
PLAY [localhost] *************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [set NODE_IPS of the etcd cluster] **************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [get etcd cluster status] ***********************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [debug] *****************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
"ETCD_CLUSTER_STATUS": {
"changed": true,
"cmd": "for ip in 192.168.10.107 192.168.10.108 192.168.10.109 ;do ETCDCTL_API=3 /etc/kubeasz/bin/etcdctl --endpoints=https://\"$ip\":2379 --cacert=/etc/kubeasz/clusters/qijia01/ssl/ca.pem --cert=/etc/kubeasz/clusters/qijia01/ssl/etcd.pem --key=/etc/kubeasz/clusters/qijia01/ssl/etcd-key.pem endpoint health; done",
"delta": "0:00:00.526961",
"end": "2022-04-22 18:03:04.644297",
"failed": false,
"msg": "",
"rc": 0,
"start": "2022-04-22 18:03:04.117336",
"stderr": "https://192.168.10.107:2379 is healthy: successfully committed proposal: took = 42.136716ms\nhttps://192.168.10.108:2379 is healthy: successfully committed proposal: took = 12.285904ms\nhttps://192.168.10.109:2379 is healthy: successfully committed proposal: took = 11.06195ms",
"stderr_lines": [
"https://192.168.10.107:2379 is healthy: successfully committed proposal: took = 42.136716ms",
"https://192.168.10.108:2379 is healthy: successfully committed proposal: took = 12.285904ms",
"https://192.168.10.109:2379 is healthy: successfully committed proposal: took = 11.06195ms"
],
"stdout": "",
"stdout_lines": []
}
}
TASK [get a running ectd node] ***********************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [debug] *****************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
"RUNNING_NODE.stdout": "192.168.10.107"
}
TASK [get current time] ******************************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [make a backup on the etcd node] ****************************************************************************************************************************************************************************************************************************
changed: [localhost -> 192.168.10.107]
TASK [fetch the backup data] *************************************************************************************************************************************************************************************************************************************
changed: [localhost -> 192.168.10.107]
TASK [update the latest backup] **********************************************************************************************************************************************************************************************************************************
changed: [localhost]
PLAY RECAP *******************************************************************************************************************************************************************************************************************************************************
localhost : ok=10 changed=6 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
查看备份的数据是否存在
snapshot.db是复制的最新备份文件的数据,恢复的脚本写死了文件名为snapshot.db,但我们备份的文件是以时间戳命名的 所以在备份完成后会执行cp snapshot_202204221803.db snapshot.db
root@master-1:~# ll /etc/kubeasz/clusters/qijia01/backup/
total 5248
drwxr-xr-x 2 root root 57 Apr 22 18:03 ./
drwxr-xr-x 5 root root 203 Apr 20 18:44 ../
-rw------- 1 root root 2682912 Apr 22 18:03 snapshot.db
-rw------- 1 root root 2682912 Apr 22 18:03 snapshot_202204221803.db
删掉一个pod 看恢复的时候是否能还原回来
root@master-1:~# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default net-test1 1/1 Running 0 66m
default net-test2 1/1 Running 0 2d22h
kube-system calico-kube-controllers-647f956d86-zrjq9 1/1 Running 0 2d23h
kube-system calico-node-47phc 1/1 Running 0 2d23h
kube-system calico-node-9ghhw 1/1 Running 0 2d23h
kube-system calico-node-c7stp 1/1 Running 0 2d23h
kube-system calico-node-lcjsx 1/1 Running 0 2d23h
kube-system coredns-74c56d8f8d-d2jbp 1/1 Running 0 2d
kube-system coredns-74c56d8f8d-vds9h 1/1 Running 0 2d
kubernetes-dashboard dashboard-metrics-scraper-c45b7869d-5h8t7 1/1 Running 0 47h
kubernetes-dashboard kubernetes-dashboard-576cb95f94-mzwpz 1/1 Running 0 47h
root@master-1:~# kubectl delete pod net-test1
pod "net-test1" deleted
root@master-1:~#
root@master-1:~# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default net-test2 1/1 Running 0 2d22h
kube-system calico-kube-controllers-647f956d86-zrjq9 1/1 Running 0 2d23h
kube-system calico-node-47phc 1/1 Running 0 2d23h
kube-system calico-node-9ghhw 1/1 Running 0 2d23h
kube-system calico-node-c7stp 1/1 Running 0 2d23h
kube-system calico-node-lcjsx 1/1 Running 0 2d23h
kube-system coredns-74c56d8f8d-d2jbp 1/1 Running 0 2d
kube-system coredns-74c56d8f8d-vds9h 1/1 Running 0 2d
kubernetes-dashboard dashboard-metrics-scraper-c45b7869d-5h8t7 1/1 Running 0 47h
kubernetes-dashboard kubernetes-dashboard-576cb95f94-mzwpz 1/1 Running 0 47h
恢复数据并验证删除的pod是否被还原
root@master-1:~# ezctl restore qijia01
ansible-playbook -i clusters/qijia01/hosts -e @clusters/qijia01/config.yml playbooks/95.restore.yml
2022-04-22 18:06:01 INFO cluster:qijia01 restore begins in 5s, press any key to abort:
PLAY [kube_master] ***********************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************
ok: [192.168.10.101]
ok: [192.168.10.102]
TASK [stopping kube_master services] *****************************************************************************************************************************************************************************************************************************
changed: [192.168.10.102] => (item=kube-apiserver)
changed: [192.168.10.101] => (item=kube-apiserver)
changed: [192.168.10.102] => (item=kube-controller-manager)
changed: [192.168.10.102] => (item=kube-scheduler)
changed: [192.168.10.101] => (item=kube-controller-manager)
changed: [192.168.10.101] => (item=kube-scheduler)
PLAY [kube_master,kube_node] *************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************
ok: [192.168.10.104]
ok: [192.168.10.105]
TASK [stopping kube_node services] *******************************************************************************************************************************************************************************************************************************
changed: [192.168.10.105] => (item=kubelet)
changed: [192.168.10.101] => (item=kubelet)
changed: [192.168.10.102] => (item=kubelet)
changed: [192.168.10.104] => (item=kubelet)
changed: [192.168.10.105] => (item=kube-proxy)
changed: [192.168.10.101] => (item=kube-proxy)
changed: [192.168.10.102] => (item=kube-proxy)
changed: [192.168.10.104] => (item=kube-proxy)
PLAY [etcd] ******************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************
ok: [192.168.10.107]
ok: [192.168.10.109]
ok: [192.168.10.108]
TASK [cluster-restore : 停止ectd 服务] ***************************************************************************************************************************************************************************************************************************
changed: [192.168.10.109]
changed: [192.168.10.108]
changed: [192.168.10.107]
TASK [cluster-restore : 清除etcd 数据目录] ***********************************************************************************************************************************************************************************************************************
changed: [192.168.10.108]
changed: [192.168.10.109]
changed: [192.168.10.107]
TASK [cluster-restore : 生成备份目录] ****************************************************************************************************************************************************************************************************************************
ok: [192.168.10.107]
changed: [192.168.10.109]
changed: [192.168.10.108]
TASK [cluster-restore : 准备指定的备份etcd 数据] *****************************************************************************************************************************************************************************************************************
changed: [192.168.10.109]
changed: [192.168.10.108]
changed: [192.168.10.107]
TASK [cluster-restore : 清理上次备份恢复数据] ********************************************************************************************************************************************************************************************************************
ok: [192.168.10.107]
ok: [192.168.10.108]
ok: [192.168.10.109]
TASK [cluster-restore : etcd 数据恢复] ***************************************************************************************************************************************************************************************************************************
changed: [192.168.10.107]
changed: [192.168.10.108]
changed: [192.168.10.109]
TASK [cluster-restore : 恢复数据至etcd 数据目录] *****************************************************************************************************************************************************************************************************************
changed: [192.168.10.108]
changed: [192.168.10.107]
changed: [192.168.10.109]
TASK [cluster-restore : 重启etcd 服务] ***************************************************************************************************************************************************************************************************************************
changed: [192.168.10.107]
changed: [192.168.10.109]
changed: [192.168.10.108]
TASK [cluster-restore : 以轮询的方式等待服务同步完成] ************************************************************************************************************************************************************************************************************
changed: [192.168.10.107]
changed: [192.168.10.108]
changed: [192.168.10.109]
PLAY [kube_master] ***********************************************************************************************************************************************************************************************************************************************
TASK [starting kube_master services] *****************************************************************************************************************************************************************************************************************************
changed: [192.168.10.102] => (item=kube-apiserver)
changed: [192.168.10.102] => (item=kube-controller-manager)
changed: [192.168.10.101] => (item=kube-apiserver)
changed: [192.168.10.102] => (item=kube-scheduler)
changed: [192.168.10.101] => (item=kube-controller-manager)
changed: [192.168.10.101] => (item=kube-scheduler)
PLAY [kube_master,kube_node] *************************************************************************************************************************************************************************************************************************************
TASK [starting kube_node services] *******************************************************************************************************************************************************************************************************************************
changed: [192.168.10.104] => (item=kubelet)
changed: [192.168.10.102] => (item=kubelet)
changed: [192.168.10.105] => (item=kubelet)
changed: [192.168.10.101] => (item=kubelet)
changed: [192.168.10.105] => (item=kube-proxy)
changed: [192.168.10.102] => (item=kube-proxy)
changed: [192.168.10.104] => (item=kube-proxy)
changed: [192.168.10.101] => (item=kube-proxy)
PLAY RECAP *******************************************************************************************************************************************************************************************************************************************************
192.168.10.101 : ok=5 changed=4 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.10.102 : ok=5 changed=4 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.10.104 : ok=3 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.10.105 : ok=3 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.10.107 : ok=10 changed=7 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.10.108 : ok=10 changed=8 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.10.109 : ok=10 changed=8 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
此时看到 net-test1 已经恢复
root@master-1:~# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default net-test1 1/1 Running 0 73m
default net-test2 1/1 Running 0 2d23h
kube-system calico-kube-controllers-647f956d86-zrjq9 1/1 Running 0 2d23h
kube-system calico-node-47phc 1/1 Running 0 2d23h
kube-system calico-node-9ghhw 1/1 Running 0 2d23h
kube-system calico-node-c7stp 1/1 Running 0 2d23h
kube-system calico-node-lcjsx 1/1 Running 0 2d23h
kube-system coredns-74c56d8f8d-d2jbp 1/1 Running 0 2d
kube-system coredns-74c56d8f8d-vds9h 1/1 Running 0 2d1h
kubernetes-dashboard dashboard-metrics-scraper-c45b7869d-5h8t7 1/1 Running 0 2d
kubernetes-dashboard kubernetes-dashboard-576cb95f94-mzwpz 1/1 Running 1 2d
root@master-1:~#
比较遗憾的是此种备份和恢复方式是全量备份 全量恢复,如果我只是想恢复某一个namespace下面的pod 那么这方法就无法实现
恢复etcd的流程
当etcd的集群宕机总数超过了节点数的一半时 机会导致集群宕机 后期需要恢复数据流程如下
- 恢复服务器系统
- 重新部署etcd集群
- 停止kube-apiserver/controller-manager/scheduer/kubelet/kube-proxy
- 停止etcd集群
- 各个etcd节点恢复同一份备份数据
- 启动etcd集群并验证节点健康状态
- 启动kube-apiserver/controller-manager/scheduer/kubelet/kube-proxy
- 验证k8s master状态及pod数据