CI: Add delay between joining masters #844

pablochacin · 2019-11-20T12:02:35Z

Why is this PR needed?

After joining a master, some time is required to allow etcd to become stable. Trying to join another node may fail with diverse errors depending on the status of etcd at the moment of the join. For example:

E1120 04:33:15.495795   21470 ssh.go:192] error execution phase control-plane-join/update-status: error uploading configuration: etcdserver: request timed out
I1120 04:33:15.500262   21470 ssh.go:167] running command: "sudo sh -c 'rm /tmp/kubeadm-init.conf'"
[join] failed to apply join to node failed to apply state kubeadm.join: Process exited with status 1
F1120 04:33:15.677530   21470 join.go:62] error joining node caasp-master-101-caasp-jobs-e2e-caasp-v4-openstack-test-ci-1: failed to apply state kubeadm.join: Process exited with status 1

This is a well-known issue upstream and the only reliable way to walk around it is to force a wait (checking node status or etcd status is not reliable as may give false positives)

Fixes https://github.com/SUSE/avant-garde/issues/1061

What does this PR do?

Forces a delay after joining master nodes and exposes this delay as a testrunner argument.

Merge restrictions

(Please do not edit this)

We are in v4-maintenance phase, so we will restrict what can be merged to prevent unexpected surprises:

What can be merged (merge criteria):
    2 approvals:
        1 developer: code is fine
        1 QA: QA is fine
    there is a PR for updating documentation (or a statement that this is not needed)

Add deleay to allow etcd to become ready. Signed-off-by: Pablo Chacin <[email protected]>

ereslibre

I'm +1 to the change in CI, but eventually we should check what's wrong in skuba and kubeadm ultimately.

kubeadm introduced some changes recently that might help in this regard: kubernetes/kubernetes#85201. Also, in theory, kubeadm is checking the etcd healthy status before continuing: https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/cmd/kubeadm/app/cmd/join.go#L207, that runs https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/cmd/kubeadm/app/phases/etcd/local.go#L85.

So maybe this is something we want to dig into, I'm a bit worried about finding those errors since kubeadm is supposed to be checking this.

tdaines42

LGTM

innobead

We need to have a way to fix the issue first.

ref: today failed caused by the same issue. https://ci.suse.de/view/CaaSP/job/caasp-jobs/job/caasp-v4-vmware-nightly/183/execution/node/44/log/

cc @c3y1huang

Itxaka

Lets fix CI first, find the root issue later!

ereslibre · 2019-11-21T08:54:59Z

Ah, okay, looking at https://ci.suse.de/view/CaaSP/job/caasp-jobs/job/caasp-v4-vmware-nightly/183/execution/node/44/log/, I see what's going on. We are checking healthiness correctly with kubeadm, the problem comes with the etcd blackout when growing from 1 to 2. We have been also adapting these timeouts lately on kubeadm.

One question, on every CI run, are we checking the disk speed with dd or something similar? It would be great to have this as well just to check if we are in a slow disk node or something similar. I wouldn't expect this etcd blackout to take this long on nodes that have been as short-lived as the ones on CI (maybe even with fast disks the etcd blackout can still happen for some time, if there's a lot to sync -- if the cluster has been up for a long time, or had many events).

pablochacin · 2019-11-21T10:03:41Z

@innobead this is an upstream issue. I've invested some time researching it and as @ereslibre mentions, the main factor seems to be the speed of etcd to become ready and this, in turn, depends on the disk and maybe also network. I wanted to check the readiness of the etcd itself calling it from the testrunner but it is not easy as I would need a client certificate. So as @Itxaka says, let's fix the CI and keep investigating the root cause.

pablochacin · 2019-11-21T10:07:19Z

@ereslibre @jordimassaguerpla tried to adapt the terraform to use local storage but apparently it was not easy. Maybe we could give it a second try?

innobead · 2019-11-21T10:08:58Z

@innobead this is an upstream issue. I've invested some time researching it and as @ereslibre mentions, the main factor seems to be the speed of etcd to become ready and this, in turn, depends on the disk and maybe also network. I wanted to check the readiness of the etcd itself calling it from the testrunner but it is not easy as I would need a client certificate. So as @Itxaka says, let's fix the CI and keep investigating the root cause.

yep, got your point. Let's do this way. thanks.

Add delay between joining masters

5a9e215

Add deleay to allow etcd to become ready. Signed-off-by: Pablo Chacin <[email protected]>

ereslibre reviewed Nov 20, 2019

View reviewed changes

tdaines42 approved these changes Nov 20, 2019

View reviewed changes

innobead approved these changes Nov 21, 2019

View reviewed changes

innobead assigned pablochacin Nov 21, 2019

Itxaka approved these changes Nov 21, 2019

View reviewed changes

pablochacin merged commit 9508f11 into SUSE:master Nov 21, 2019

pablochacin mentioned this pull request Nov 22, 2019

CI: Increase delay after joining masters #850

Merged

c3y1huang mentioned this pull request Nov 25, 2019

ci joining masters still unstable. #852

Closed

pablochacin deleted the add-delay-join-nodes branch July 10, 2020 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Add delay between joining masters #844

CI: Add delay between joining masters #844

pablochacin commented Nov 20, 2019

ereslibre left a comment

tdaines42 left a comment

innobead left a comment •

edited

Loading

Itxaka left a comment

ereslibre commented Nov 21, 2019

pablochacin commented Nov 21, 2019

pablochacin commented Nov 21, 2019

innobead commented Nov 21, 2019

CI: Add delay between joining masters #844

CI: Add delay between joining masters #844

Conversation

pablochacin commented Nov 20, 2019

Why is this PR needed?

What does this PR do?

Merge restrictions

ereslibre left a comment

Choose a reason for hiding this comment

tdaines42 left a comment

Choose a reason for hiding this comment

innobead left a comment • edited Loading

Choose a reason for hiding this comment

Itxaka left a comment

Choose a reason for hiding this comment

ereslibre commented Nov 21, 2019

pablochacin commented Nov 21, 2019

pablochacin commented Nov 21, 2019

innobead commented Nov 21, 2019

innobead left a comment •

edited

Loading