-
Notifications
You must be signed in to change notification settings - Fork 57
Conversation
Add deleay to allow etcd to become ready. Signed-off-by: Pablo Chacin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 to the change in CI, but eventually we should check what's wrong in skuba and kubeadm ultimately.
kubeadm
introduced some changes recently that might help in this regard: kubernetes/kubernetes#85201. Also, in theory, kubeadm
is checking the etcd
healthy status before continuing: https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/cmd/kubeadm/app/cmd/join.go#L207, that runs https://github.com/kubernetes/kubernetes/blob/5bac42bff9bfb9dfe0f2ea40f1c80cac47fc12b2/cmd/kubeadm/app/phases/etcd/local.go#L85.
So maybe this is something we want to dig into, I'm a bit worried about finding those errors since kubeadm is supposed to be checking this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have a way to fix the issue first.
ref: today failed caused by the same issue. https://ci.suse.de/view/CaaSP/job/caasp-jobs/job/caasp-v4-vmware-nightly/183/execution/node/44/log/
cc @c3y1huang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets fix CI first, find the root issue later!
Ah, okay, looking at https://ci.suse.de/view/CaaSP/job/caasp-jobs/job/caasp-v4-vmware-nightly/183/execution/node/44/log/, I see what's going on. We are checking healthiness correctly with kubeadm, the problem comes with the One question, on every CI run, are we checking the disk speed with |
@innobead this is an upstream issue. I've invested some time researching it and as @ereslibre mentions, the main factor seems to be the speed of etcd to become ready and this, in turn, depends on the disk and maybe also network. I wanted to check the readiness of the etcd itself calling it from the testrunner but it is not easy as I would need a client certificate. So as @Itxaka says, let's fix the CI and keep investigating the root cause. |
@ereslibre @jordimassaguerpla tried to adapt the terraform to use local storage but apparently it was not easy. Maybe we could give it a second try? |
yep, got your point. Let's do this way. thanks. |
Why is this PR needed?
After joining a master, some time is required to allow etcd to become stable. Trying to join another node may fail with diverse errors depending on the status of etcd at the moment of the join. For example:
This is a well-known issue upstream and the only reliable way to walk around it is to force a wait (checking node status or etcd status is not reliable as may give false positives)
Fixes https://github.com/SUSE/avant-garde/issues/1061
What does this PR do?
Forces a delay after joining master nodes and exposes this delay as a testrunner argument.
Merge restrictions
(Please do not edit this)
We are in v4-maintenance phase, so we will restrict what can be merged to prevent unexpected surprises: