Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm-join: don't re-add etcd member if it already exists #2095

Closed
fabriziopandini opened this issue Mar 30, 2020 · 13 comments
Closed

kubeadm-join: don't re-add etcd member if it already exists #2095

fabriziopandini opened this issue Mar 30, 2020 · 13 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@fabriziopandini
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

Versions

kubeadm version: v1.17.*

What happened?

While executing Cluster API tests, in some cases it was observed kubeadm join failures.

xref kubernetes-sigs/cluster-api#2769

What you expected to happen?

I would like to re-run kubeadm join and the command to complete the join process and this behavior to be guaranteed in the future versions

Anything else we need to know?

Most of all the phases should be already idempotent. However, this behavior should be verified (especially for add new etcd member phase)

NB. it is not required that preflight checks are idempotent

@neolit123 neolit123 added this to the v1.19 milestone Mar 30, 2020
@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 30, 2020
@RA489
Copy link
Contributor

RA489 commented Apr 7, 2020

/assign

@MikaelSmith
Copy link

I've encountered this and the only step that caused problems for me was etcd. If I aborted a prior attempt while etcd was waiting to connect, it connects, then go to re-run kubeadm join I get:

[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
...
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: Peer URLs already exists
To see the stack trace of this error execute with --v=5 or higher
Failed to join the kubernetes cluster.

I've worked around this by manually running the final steps:

kubeadm join phase control-plane-join update-status --config kubeadm.conf
kubeadm join phase control-plane-join mark-control-plane --config kubeadm.conf

For context, how I've run into this: etcd failed on the host and was automatically removed from the member list. I went to re-add the host by re-running kubeadm join, but forgot to delete etcd's data dir, so etcd gets into a failure loop. I then quit kubeadm join to go investigate why etcd startup is timing out, see the notice to delete the data dir and do so, and etcd joins the member list. When I re-run kubeadm join I get this error.

@neolit123
Copy link
Member

neolit123 commented May 13, 2020

hi, @MikaelSmith

Peer URLs already exists

i think this error only happens if the new member already exist in the cluster.

For context, how I've run into this: etcd failed on the host and was automatically removed from the member list. I went to re-add the host by re-running kubeadm join, but forgot to delete etcd's data dir, so etcd gets into a failure loop.

did you use --ignore-preflight-errors=all when calling kubeadm join?
normally kubeadm will fail if this directory is not empty.

I then quit kubeadm join to go investigate why etcd startup is timing out, see the notice to delete the data dir and do so, and etcd joins the member list.

i don't think etcd local storage is something we considered in terms of idempotency, so we might have to revise here.

When I re-run kubeadm join I get this error.

to clarify, you got the error from join even after clearing the data directory?

@MikaelSmith
Copy link

I did use --ignore-preflight-errors=all with the first call to `kubeadm join' (it's part of a script from https://kurl.sh, so I can go back to them about why they ignore them).

Yes, I got the error even after clearing the data directory, I think because the etcd pod still existed from the prior run and succeeded in joining the cluster before I re-ran kubeadm join.

@neolit123
Copy link
Member

neolit123 commented May 13, 2020

it's part of a script from https://kurl.sh, so I can go back to them about why they ignore them

i see that this is a core principle of the service:
curl https://kurl.sh/latest | sudo bash

but overall not a recommended security practice.

Yes, I got the error even after clearing the data directory, I think because the etcd pod still existed from the prior run and succeeded in joining the cluster before I re-ran kubeadm join.

the idea of "join" for control-plane being idempotent is tricky, if you did a kubeadm reset on that node, the subsequent join would have worked i assume.

@neolit123
Copy link
Member

@fabriziopandini
i don't think this is feasible for 1.19, since it's unclear to me how we are planning to reuse existing etcd member snapshots. also preflight checks throw errors for a number of existing folders, so does this mean to make the join action idempotent one has to skip these checks?

@fabriziopandini
Copy link
Member Author

@neolit123 my idea was to simply check if the etcd member does not already exist before adding the member, so if the command is executed again the operations become a NOP
WDYT about this solution?

@neolit123
Copy link
Member

ok, makes sense. i will rename this ticket slightly

/retitle kubeadm-join: don't re-add etcd member if it already exists

@k8s-ci-robot k8s-ci-robot changed the title make kubeadm join idempotent kubeadm-join: don't re-add etcd member if it already exists Jun 8, 2020
@neolit123
Copy link
Member

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 11, 2020
@neolit123
Copy link
Member

@fabriziopandini
PR is here:
kubernetes/kubernetes#92118

i have not tested if this idea works...

if the member is in the cluster already, the new code should detect that. but my assumption here is that existing member would be unhealthy. so, i don't know if just creating this new server instance, without re-adding the member would make the existing member healthy.

if you know this works for sure, fine by me.

@neolit123 neolit123 assigned neolit123 and unassigned RA489 Jun 14, 2020
@neolit123
Copy link
Member

cherry picks:
kubernetes/kubernetes#92236
kubernetes/kubernetes#92235

@neolit123 neolit123 modified the milestones: v1.19, v1.20 Jul 27, 2020
@neolit123
Copy link
Member

this was cherry picked but someone mentioned that there could be an issue WRT how we get the list of etcd members in kubeadm. so let's keep an eye on that but open a new issue.

/close

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

this was cherry picked but someone mentioned that there could be an issue WRT how we get the list of etcd members in kubeadm. so let's keep an eye on that but open a new issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

5 participants