Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add UseExperimentalRetryJoin to KubeadmConfig #2763

Merged
merged 1 commit into from
Mar 25, 2020

Conversation

randomvariable
Copy link
Member

Signed-off-by: Naadir Jeewa [email protected]

What this PR does / why we need it:
Resolve flakey control plane joins by creating a bash script that retries kubeadm control plane join phases. Particularly for CAPV, HAProxy always starts new backends as ready, pending healthchecks (if anyone knows how to change that I'm all ears), vs. AWS ELB which does the exact opposite and hasn't demonstrated these issues before.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 24, 2020
@randomvariable
Copy link
Member Author

I'll be adding tests to test framework and capd to make test connection failures (ala Jepsen) in another PR.

@vincepri, you wanted to put this behind a feature flag, right?

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@randomvariable thanks for this PR!
Hope this change/all the logging added will help to surface the underlying problem and to implement a proper fix in kubeadm.

bootstrap/kubeadm/internal/cloudinit/controlplane_join.go Outdated Show resolved Hide resolved
bootstrap/kubeadm/internal/cloudinit/controlplane_join.go Outdated Show resolved Hide resolved
bootstrap/kubeadm/internal/cloudinit/controlplane_join.go Outdated Show resolved Hide resolved
bootstrap/kubeadm/internal/cloudinit/controlplane_join.go Outdated Show resolved Hide resolved
@randomvariable randomvariable force-pushed the boootstrappy branch 2 times, most recently from 74b16cc to b07c80b Compare March 24, 2020 14:50
feature/feature.go Outdated Show resolved Hide resolved
@randomvariable randomvariable changed the title [wip] :run: cabpk: Add retries to control plane join :run: cabpk: Add retries to control plane join Mar 24, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2020
@randomvariable
Copy link
Member Author

  • Shell script seperated and embedded using go-bindata
  • Flag added to kubeadm types to enable behaviour

@frapposelli
Copy link
Member

Tested with several failure scenarios, including etcd failures on join, and it's WAY more robust than before. 100% success rate vs. 20% success rate with the previous implementation when etcd is overloaded. 🎉

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2020
Copy link
Member

@yastij yastij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @vincepri

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2020
@vincepri
Copy link
Member

Reviewing now

@vincepri
Copy link
Member

/milestone v0.3.3

@k8s-ci-robot k8s-ci-robot added this to the v0.3.3 milestone Mar 25, 2020
@neolit123
Copy link
Member

BTW, we had a meeting about the problem and i proposed:

we should retry potential etcdctl / kubectl health checks before join instead of kubeadm join commands (or phases).

ideally...

@vincepri
Copy link
Member

/retitle ✨ Add UseExperimentalRetryJoin to KubeadmConfig

@k8s-ci-robot k8s-ci-robot changed the title :run: cabpk: Add retries to control plane join ✨ Add UseExperimentalRetryJoin to KubeadmConfig Mar 25, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2020
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 25, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: randomvariable, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4672438 into kubernetes-sigs:master Mar 25, 2020
@cahillsf cahillsf mentioned this pull request Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants