Introduction

In my kubernetes cluster I 6 nodes - 5 virtual machines and one physical.

The physical node was the remaining master/control-plane node from when my cluster was running on physical machines.

I wanted to shut down the remaning physical machine so I could save a little power (very little, since the wyse machines uses around 10-15 w when idle).

But I also wanted it to be removed, so my entire cluster was running on my proxmox cluster that also hosts the ceph cluster - so everything is contained on the 5 physical machines that runs proxmox & ceph.

When I tried to join one of the 5 virtual machines as a master node I got a preflight error stating that my etcd cluster was not healthy.

At that moment it dawned to me that I most likely just yanked the old control plane from the network and removed it via kubectl delete node node2.xxx

So I had to make my etcd cluster healthy before I could join a new control plane.

Guide

Find dead node

Either ssh into one of the etcd nodes - or if your etcd is embedded inside your kubernetes cluster - find one of the etcd pods open a shell and run:

etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

If your cluster is healthy it will give you a list similar to:

4b5ac0e4312a4c7, started, node12.k8s.root.dom, https://192.168.5.12:2380, https://192.168.5.12:2379, false
4ba6195025ea3779, started, node11.k8s.root.dom, https://192.168.5.11:2380, https://192.168.5.11:2379, false
e34bf52a228f034d, started, node10.k8s.root.dom, https://192.168.5.10:2380, https://192.168.5.10:2379, false

If its not healthy like mine was - the started column will say something else - it could be unstarted.

So the dead node needs to be removed - so take a note of the id in the example above 4b5ac0e4312a4c7 is the id for node node12.k8s.root.dom.

Remove dead node

With the id of the dead node - its as simple as running:

etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member remove <id>

This will remove the node immediately from the etcd cluster and allow you to join a new master node to your kubernetes cluster.

I hope you enjoyed this post and if you spot errors, please let me know in the comments below on on email directly.

How to remove dead etcd node from etcd cluster

Introduction

Guide

Find dead node

Remove dead node

Comments