How to remove dead etcd node from etcd cluster
Introduction
In my kubernetes cluster I 6 nodes - 5 virtual machines and one physical.
The physical node was the remaining master/control-plane node from when my cluster was running on physical machines.
I wanted to shut down the remaning physical machine so I could save a little power (very little, since the wyse machines uses around 10-15 w when idle).
But I also wanted it to be removed, so my entire cluster was running on my proxmox cluster that also hosts the ceph cluster - so everything is contained on the 5 physical machines that runs proxmox & ceph.
When I tried to join one of the 5 virtual machines as a master node I got a preflight error stating that my etcd cluster was not healthy.
At that moment it dawned to me that I most likely just yanked the old control plane from the network and removed it via kubectl delete node node2.xxx
So I had to make my etcd cluster healthy before I could join a new control plane.
Guide
Find dead node
Either ssh into one of the etcd nodes - or if your etcd is embedded inside your kubernetes cluster - find one of the etcd pods open a shell and run:
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
If your cluster is healthy it will give you a list similar to:
4b5ac0e4312a4c7, started, node12.k8s.root.dom, https://192.168.5.12:2380, https://192.168.5.12:2379, false
4ba6195025ea3779, started, node11.k8s.root.dom, https://192.168.5.11:2380, https://192.168.5.11:2379, false
e34bf52a228f034d, started, node10.k8s.root.dom, https://192.168.5.10:2380, https://192.168.5.10:2379, false
If its not healthy like mine was - the started
column will say something else - it could be unstarted.
So the dead node needs to be removed - so take a note of the id
in the example above 4b5ac0e4312a4c7
is the id for node node12.k8s.root.dom.
Remove dead node
With the id of the dead node - its as simple as running:
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member remove <id>
This will remove the node immediately from the etcd cluster and allow you to join a new master node to your kubernetes cluster.
I hope you enjoyed this post and if you spot errors, please let me know in the comments below on on email directly.