How to remove dead etcd node from etcd cluster

Mon, Sep 4, 2023 2-minute read

Introduction

In my kubernetes cluster I 6 nodes - 5 virtual machines and one physical.

The physical node was the remaining master/control-plane node from when my cluster was running on physical machines.

I wanted to shut down the remaning physical machine so I could save a little power (very little, since the wyse machines uses around 10-15 w when idle).

But I also wanted it to be removed, so my entire cluster was running on my proxmox cluster that also hosts the ceph cluster - so everything is contained on the 5 physical machines that runs proxmox & ceph.

When I tried to join one of the 5 virtual machines as a master node I got a preflight error stating that my etcd cluster was not healthy.

At that moment it dawned to me that I most likely just yanked the old control plane from the network and removed it via kubectl delete node node2.xxx

So I had to make my etcd cluster healthy before I could join a new control plane.

Guide

Find dead node

Either ssh into one of the etcd nodes - or if your etcd is embedded inside your kubernetes cluster - find one of the etcd pods open a shell and run:

etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

If your cluster is healthy it will give you a list similar to:

4b5ac0e4312a4c7, started, node12.k8s.root.dom, https://192.168.5.12:2380, https://192.168.5.12:2379, false
4ba6195025ea3779, started, node11.k8s.root.dom, https://192.168.5.11:2380, https://192.168.5.11:2379, false
e34bf52a228f034d, started, node10.k8s.root.dom, https://192.168.5.10:2380, https://192.168.5.10:2379, false

If its not healthy like mine was - the started column will say something else - it could be unstarted.

So the dead node needs to be removed - so take a note of the id in the example above 4b5ac0e4312a4c7 is the id for node node12.k8s.root.dom.

Remove dead node

With the id of the dead node - its as simple as running:

etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member remove <id>

This will remove the node immediately from the etcd cluster and allow you to join a new master node to your kubernetes cluster.

I hope you enjoyed this post and if you spot errors, please let me know in the comments below on on email directly.