kubeadm-join: don't re-add etcd member if it already exists #2095

fabriziopandini · 2020-03-30T14:36:03Z

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

Versions

kubeadm version: v1.17.*

What happened?

While executing Cluster API tests, in some cases it was observed kubeadm join failures.

xref kubernetes-sigs/cluster-api#2769

What you expected to happen?

I would like to re-run kubeadm join and the command to complete the join process and this behavior to be guaranteed in the future versions

Anything else we need to know?

Most of all the phases should be already idempotent. However, this behavior should be verified (especially for add new etcd member phase)

NB. it is not required that preflight checks are idempotent

RA489 · 2020-04-07T08:22:16Z

/assign

MikaelSmith · 2020-05-12T23:42:46Z

I've encountered this and the only step that caused problems for me was etcd. If I aborted a prior attempt while etcd was waiting to connect, it connects, then go to re-run kubeadm join I get:

[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
...
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: Peer URLs already exists
To see the stack trace of this error execute with --v=5 or higher
Failed to join the kubernetes cluster.

I've worked around this by manually running the final steps:

kubeadm join phase control-plane-join update-status --config kubeadm.conf
kubeadm join phase control-plane-join mark-control-plane --config kubeadm.conf

For context, how I've run into this: etcd failed on the host and was automatically removed from the member list. I went to re-add the host by re-running kubeadm join, but forgot to delete etcd's data dir, so etcd gets into a failure loop. I then quit kubeadm join to go investigate why etcd startup is timing out, see the notice to delete the data dir and do so, and etcd joins the member list. When I re-run kubeadm join I get this error.

neolit123 · 2020-05-13T01:08:23Z

hi, @MikaelSmith

Peer URLs already exists

i think this error only happens if the new member already exist in the cluster.

For context, how I've run into this: etcd failed on the host and was automatically removed from the member list. I went to re-add the host by re-running kubeadm join, but forgot to delete etcd's data dir, so etcd gets into a failure loop.

did you use --ignore-preflight-errors=all when calling kubeadm join?
normally kubeadm will fail if this directory is not empty.

I then quit kubeadm join to go investigate why etcd startup is timing out, see the notice to delete the data dir and do so, and etcd joins the member list.

i don't think etcd local storage is something we considered in terms of idempotency, so we might have to revise here.

When I re-run kubeadm join I get this error.

to clarify, you got the error from join even after clearing the data directory?

MikaelSmith · 2020-05-13T16:34:17Z

I did use --ignore-preflight-errors=all with the first call to `kubeadm join' (it's part of a script from https://kurl.sh, so I can go back to them about why they ignore them).

Yes, I got the error even after clearing the data directory, I think because the etcd pod still existed from the prior run and succeeded in joining the cluster before I re-ran kubeadm join.

neolit123 · 2020-05-13T17:27:22Z

it's part of a script from https://kurl.sh, so I can go back to them about why they ignore them

i see that this is a core principle of the service:
curl https://kurl.sh/latest | sudo bash

but overall not a recommended security practice.

Yes, I got the error even after clearing the data directory, I think because the etcd pod still existed from the prior run and succeeded in joining the cluster before I re-ran kubeadm join.

the idea of "join" for control-plane being idempotent is tricky, if you did a kubeadm reset on that node, the subsequent join would have worked i assume.

neolit123 · 2020-06-05T13:28:27Z

@fabriziopandini
i don't think this is feasible for 1.19, since it's unclear to me how we are planning to reuse existing etcd member snapshots. also preflight checks throw errors for a number of existing folders, so does this mean to make the join action idempotent one has to skip these checks?

fabriziopandini · 2020-06-07T17:00:40Z

@neolit123 my idea was to simply check if the etcd member does not already exist before adding the member, so if the command is executed again the operations become a NOP
WDYT about this solution?

neolit123 · 2020-06-08T12:25:35Z

ok, makes sense. i will rename this ticket slightly

/retitle kubeadm-join: don't re-add etcd member if it already exists

neolit123 · 2020-06-11T20:20:15Z

/lifecycle active

neolit123 · 2020-06-13T19:45:19Z

@fabriziopandini
PR is here:
kubernetes/kubernetes#92118

i have not tested if this idea works...

if the member is in the cluster already, the new code should detect that. but my assumption here is that existing member would be unhealthy. so, i don't know if just creating this new server instance, without re-adding the member would make the existing member healthy.

if you know this works for sure, fine by me.

neolit123 · 2020-06-17T19:57:37Z

cherry picks:
kubernetes/kubernetes#92236
kubernetes/kubernetes#92235

neolit123 · 2020-07-27T17:08:20Z

this was cherry picked but someone mentioned that there could be an issue WRT how we get the list of etcd members in kubeadm. so let's keep an eye on that but open a new issue.

/close

k8s-ci-robot · 2020-07-27T17:08:33Z

@neolit123: Closing this issue.

In response to this:

this was cherry picked but someone mentioned that there could be an issue WRT how we get the list of etcd members in kubeadm. so let's keep an eye on that but open a new issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 added this to the v1.19 milestone Mar 30, 2020

neolit123 added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 30, 2020

neolit123 mentioned this issue Mar 30, 2020

improve kubeadm's preflight and cluster health assurance #2096

Closed

k8s-ci-robot assigned RA489 Apr 7, 2020

MikaelSmith mentioned this issue May 13, 2020

Calling 'kubeadm join --ignore-preflight-errors=all' suppresses errors concerning etcd cleanup replicatedhq/kURL#457

Closed

neolit123 mentioned this issue May 13, 2020

No way to recover failed master node #2138

Closed

k8s-ci-robot changed the title ~~make kubeadm join idempotent~~ kubeadm-join: don't re-add etcd member if it already exists Jun 8, 2020

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 11, 2020

neolit123 mentioned this issue Jun 13, 2020

kubeadm: don't re-add an etcd member if it already exists on "join" kubernetes/kubernetes#92118

Merged

neolit123 assigned neolit123 and unassigned RA489 Jun 14, 2020

neolit123 modified the milestones: v1.19, v1.20 Jul 27, 2020

k8s-ci-robot closed this as completed Jul 27, 2020

killianmuldoon mentioned this issue Jul 15, 2022

Deprecate experimentalRetryJoin in CABPK kubernetes-sigs/cluster-api#5597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm-join: don't re-add etcd member if it already exists #2095

kubeadm-join: don't re-add etcd member if it already exists #2095

fabriziopandini commented Mar 30, 2020

RA489 commented Apr 7, 2020

MikaelSmith commented May 12, 2020

neolit123 commented May 13, 2020 •

edited

Loading

MikaelSmith commented May 13, 2020

neolit123 commented May 13, 2020 •

edited

Loading

neolit123 commented Jun 5, 2020

fabriziopandini commented Jun 7, 2020

neolit123 commented Jun 8, 2020

neolit123 commented Jun 11, 2020

neolit123 commented Jun 13, 2020

neolit123 commented Jun 17, 2020

neolit123 commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

kubeadm-join: don't re-add etcd member if it already exists #2095

kubeadm-join: don't re-add etcd member if it already exists #2095

Comments

fabriziopandini commented Mar 30, 2020

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

Anything else we need to know?

RA489 commented Apr 7, 2020

MikaelSmith commented May 12, 2020

neolit123 commented May 13, 2020 • edited Loading

MikaelSmith commented May 13, 2020

neolit123 commented May 13, 2020 • edited Loading

neolit123 commented Jun 5, 2020

fabriziopandini commented Jun 7, 2020

neolit123 commented Jun 8, 2020

neolit123 commented Jun 11, 2020

neolit123 commented Jun 13, 2020

neolit123 commented Jun 17, 2020

neolit123 commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

neolit123 commented May 13, 2020 •

edited

Loading

neolit123 commented May 13, 2020 •

edited

Loading