Skip to content
This repository has been archived by the owner on Dec 16, 2024. It is now read-only.

CI: Add delay between joining masters #844

Merged
merged 1 commit into from
Nov 21, 2019

Conversation

pablochacin
Copy link
Contributor

Why is this PR needed?

After joining a master, some time is required to allow etcd to become stable. Trying to join another node may fail with diverse errors depending on the status of etcd at the moment of the join. For example:

E1120 04:33:15.495795   21470 ssh.go:192] error execution phase control-plane-join/update-status: error uploading configuration: etcdserver: request timed out
I1120 04:33:15.500262   21470 ssh.go:167] running command: "sudo sh -c 'rm /tmp/kubeadm-init.conf'"
[join] failed to apply join to node failed to apply state kubeadm.join: Process exited with status 1
F1120 04:33:15.677530   21470 join.go:62] error joining node caasp-master-101-caasp-jobs-e2e-caasp-v4-openstack-test-ci-1: failed to apply state kubeadm.join: Process exited with status 1

This is a well-known issue upstream and the only reliable way to walk around it is to force a wait (checking node status or etcd status is not reliable as may give false positives)

Fixes https://github.com/SUSE/avant-garde/issues/1061

What does this PR do?

Forces a delay after joining master nodes and exposes this delay as a testrunner argument.

Merge restrictions

(Please do not edit this)

We are in v4-maintenance phase, so we will restrict what can be merged to prevent unexpected surprises:

What can be merged (merge criteria):
    2 approvals:
        1 developer: code is fine
        1 QA: QA is fine
    there is a PR for updating documentation (or a statement that this is not needed)

Add deleay to allow etcd to become ready.

Signed-off-by: Pablo Chacin <[email protected]>
Copy link
Contributor

@ereslibre ereslibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 to the change in CI, but eventually we should check what's wrong in skuba and kubeadm ultimately.

kubeadm introduced some changes recently that might help in this regard: kubernetes/kubernetes#85201. Also, in theory, kubeadm is checking the etcd healthy status before continuing: https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/cmd/kubeadm/app/cmd/join.go#L207, that runs https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/cmd/kubeadm/app/phases/etcd/local.go#L85.

So maybe this is something we want to dig into, I'm a bit worried about finding those errors since kubeadm is supposed to be checking this.

Copy link
Contributor

@tdaines42 tdaines42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@innobead innobead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a way to fix the issue first.

ref: today failed caused by the same issue. https://ci.suse.de/view/CaaSP/job/caasp-jobs/job/caasp-v4-vmware-nightly/183/execution/node/44/log/

cc @c3y1huang

Copy link
Contributor

@Itxaka Itxaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets fix CI first, find the root issue later!

@ereslibre
Copy link
Contributor

Ah, okay, looking at https://ci.suse.de/view/CaaSP/job/caasp-jobs/job/caasp-v4-vmware-nightly/183/execution/node/44/log/, I see what's going on. We are checking healthiness correctly with kubeadm, the problem comes with the etcd blackout when growing from 1 to 2. We have been also adapting these timeouts lately on kubeadm.

One question, on every CI run, are we checking the disk speed with dd or something similar? It would be great to have this as well just to check if we are in a slow disk node or something similar. I wouldn't expect this etcd blackout to take this long on nodes that have been as short-lived as the ones on CI (maybe even with fast disks the etcd blackout can still happen for some time, if there's a lot to sync -- if the cluster has been up for a long time, or had many events).

@pablochacin
Copy link
Contributor Author

@innobead this is an upstream issue. I've invested some time researching it and as @ereslibre mentions, the main factor seems to be the speed of etcd to become ready and this, in turn, depends on the disk and maybe also network. I wanted to check the readiness of the etcd itself calling it from the testrunner but it is not easy as I would need a client certificate. So as @Itxaka says, let's fix the CI and keep investigating the root cause.

@pablochacin pablochacin merged commit 9508f11 into SUSE:master Nov 21, 2019
@pablochacin
Copy link
Contributor Author

@ereslibre @jordimassaguerpla tried to adapt the terraform to use local storage but apparently it was not easy. Maybe we could give it a second try?

@innobead
Copy link
Contributor

@innobead this is an upstream issue. I've invested some time researching it and as @ereslibre mentions, the main factor seems to be the speed of etcd to become ready and this, in turn, depends on the disk and maybe also network. I wanted to check the readiness of the etcd itself calling it from the testrunner but it is not easy as I would need a client certificate. So as @Itxaka says, let's fix the CI and keep investigating the root cause.

yep, got your point. Let's do this way. thanks.

@pablochacin pablochacin deleted the add-delay-join-nodes branch July 10, 2020 10:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants