Add retries for kubeadm join / MarkControlPlane #2093

fabriziopandini · 2020-03-30T14:35:46Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: v1.17.*

What happened?

While executing Cluster API tests, in some cases it was observed kubeadm join failures when adding the master label to the joining node.

xref kubernetes-sigs/cluster-api#2769

What you expected to happen?

To make mark control plane more resilient by adding a retry loop to this operation

How to reproduce it (as minimally and precisely as possible)?

This error happens only sometimes, most probably when there is a temporary blackout of the load balancer that sits in front of the API servers (HA proxy reloading his configuration).
Also, the error might happen when the new API server enters the load balancing pool but the underlying etcd member is not yet available due to slow network/slow I/O causing delays in etcd getting online or in some cases, also change fo the etcd leader.

Anything else we need to know?

Important: if possible the change should be kept as small and possible and backported

RA489 · 2020-04-02T12:12:54Z

/assign

neolit123 · 2020-05-19T15:30:21Z

@RA489 i'm going to take this ticket as i have some time later today and tomorrow.
/assign

neolit123 · 2020-06-02T21:31:36Z

@fabriziopandini

kubeadm join --control-plane calls MarkControlPlane():
https://github.com/kubernetes/kubernetes/blob/1dc5235d0a93f4594402dfc0d7bcd5db88b3b4be/cmd/kubeadm/app/phases/markcontrolplane/markcontrolplane.go#L28-L44

PatchNode() already has retries:
https://github.com/kubernetes/kubernetes/blob/5708511499fe500ae3b4bbd40204cef382f652e8/cmd/kubeadm/app/util/apiclient/idempotency.go#L312-L319

APICallRetryInterval (500 ms)
https://github.com/kubernetes/kubernetes/blob/3d7847ed01ba7bc389f22e48704087750e659997/cmd/kubeadm/app/constants/constants.go#L179

PatchNodeTimeout (2 minutes):
https://github.com/kubernetes/kubernetes/blob/3d7847ed01ba7bc389f22e48704087750e659997/cmd/kubeadm/app/constants/constants.go#L182-L183

if 2 minutes is not sufficient what value would you recommend?
possibly doubling that to 4 minutes?

fabriziopandini · 2020-06-03T11:24:57Z

@neolit123 thanks for pointing this out.
I think that two minutes are ok unless we get some more reports about failures
/close

k8s-ci-robot · 2020-06-03T11:25:12Z

@fabriziopandini: Closing this issue.

In response to this:

@neolit123 thanks for pointing this out.
I think that two minutes are ok unless we get some more reports about failures
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 added this to the v1.19 milestone Mar 30, 2020

neolit123 added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 30, 2020

neolit123 mentioned this issue Mar 30, 2020

improve kubeadm's preflight and cluster health assurance #2096

Closed

k8s-ci-robot assigned RA489 Apr 2, 2020

neolit123 mentioned this issue Apr 6, 2020

kubeadm join does not explicitly wait for etcd to have grown when joining secondary control plane #1353

Closed

k8s-ci-robot assigned neolit123 May 19, 2020

neolit123 unassigned RA489 May 19, 2020

k8s-ci-robot closed this as completed Jun 3, 2020

neolit123 mentioned this issue Jun 16, 2020

Insulate users from kubeadm API version changes kubernetes-sigs/cluster-api#2769

Closed

killianmuldoon mentioned this issue Jul 15, 2022

Deprecate experimentalRetryJoin in CABPK kubernetes-sigs/cluster-api#5597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries for kubeadm join / MarkControlPlane #2093

Add retries for kubeadm join / MarkControlPlane #2093

fabriziopandini commented Mar 30, 2020

RA489 commented Apr 2, 2020

neolit123 commented May 19, 2020

neolit123 commented Jun 2, 2020 •

edited

Loading

fabriziopandini commented Jun 3, 2020

k8s-ci-robot commented Jun 3, 2020

Add retries for kubeadm join / MarkControlPlane #2093

Add retries for kubeadm join / MarkControlPlane #2093

Comments

fabriziopandini commented Mar 30, 2020

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

RA489 commented Apr 2, 2020

neolit123 commented May 19, 2020

neolit123 commented Jun 2, 2020 • edited Loading

fabriziopandini commented Jun 3, 2020

k8s-ci-robot commented Jun 3, 2020

neolit123 commented Jun 2, 2020 •

edited

Loading