-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for validation to succeed N consecutive times #8515
Wait for validation to succeed N consecutive times #8515
Conversation
41e2875
to
2938dd8
Compare
I think we might want to replicate From #8088 / #8078 it's |
...and we should default |
I don't think I want to replicate that, it is an internal feature. This would like to replace node validation on e2e tests side where things will not be predictable. This is why it requires 10 successes in a row. Maybe it can be decreased, but not to 2.
This would change the current behaviour of the feature. Not sure this is desired. |
Do we have any data on how the e2e flakiness is behaving? I actually tend to doubt that the needs are all that different. If we're routinely getting flapping, the users of For two validations to be insufficient, the cluster would have to flap at least twice and two of the flapping successes would have to line up with the checks. Our data so far show that nodes flap once upon creation. No matter what we do, we're going to have a trade-off of wall-clock time versus risk of returning success before a flap; I'd prefer we have a fair idea of what the behavior we're addressing is, rather than just throw a wildly guessed set of delays at the process. |
This code is waiting the whole |
@johngmyers this is NOT about "known" flakiness, it is about making sure the env is stable before running e2e. You cannot predict how a bad change will affect flakiness. You are just assuming that I am expecting a regularly occurring flakiness for which we have a quick fix. It is not a command that you will run (unless you want to), but is expected to run in e2e tests and replace this: https://github.com/kubernetes/test-infra/blob/8eefa866327706ca2ed7048e5a53437917d92f0d/kubetest/kops.go#L445. Probably the time interval between successful checks can be reduced, but would like to get some more feedback on this before adding more changes. |
What is known as "flakiness" is a too high rate of intermittent false positives, when a change that is good incorrectly fails the tests. When evaluating fixes to these, one may reasonably take as an axiom that the change is good. One does have to keep in mind the risk of increasing false negatives. Passing validation too early, however, is unlikely to cause the e2e tests to pass when they otherwise would have failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hakman, zetaab The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
My objections to this still stand. e2e tests should not be a special snowflake and the |
There are cases where clusters are unstable in unpredictable ways before starting e2e. I would rather wait a bit more and know in advance that the cluster is unstable, than investigate the root cause of failing tests. Here is an aexample of such an unstable cluster that passes 4 checks in a row: Rolling update uses 30 seconds between unsuccessful checks and 10 seconds between successful checks. Rolling update has to be fast. You cannot press a button and to proceed to the next node. 2 successful validations may be good enough there. For sure is better than 1. I don't want to debate statistics. |
That log file does not have any output from |
That is because Also you can check the comments in #kops-dev from yesterday:
|
This PR is to change If kops-controller is crashing, then improving the probability of e2e passing is undesirable. |
This would replace the naive implementation e2e is using. Validating cluster N times before preceding should improve things.
It proves that cluster can fail in unexpected ways. If you want to to only validate a cluster 2 consecutive times, it is your choice. Anyone can choose how many times to do it in its own setup. I don't see any reason to have very strong opinions here. The default should stay 0 at least until this gets tested a bit more and maybe tweak the intervals.
The intention of this PR it is exactly the opposite. Validation should fail earlier. |
2938dd8
to
ad247a9
Compare
/test pull-kops-e2e-kubernetes-aws |
/retest |
I suggest replacing the implementation e2e is using first, first trying it out in a new testgrid job. Get the data on how
I don't see how that leads to the conclusion that
By waiting for N consecutive checks to pass before proceeding, it prevents the failures during those N times from causing the test to fail. If, on the other hand, a failure of one of those subsequent cluster validations caused e2e to fail, that would have likely caught the change that caused kops-controller to crash. |
As discussed during office hours, validation will fail if any of the consecutive validations fails. |
@johngmyers can you check if it looks ok now? |
Fixes #8743 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks @zetaab :) |
/retest |
Validation of the cluster during e2e tests is not that reliable. Nodes are expected to be in
Ready
state and pass validation once:https://github.com/kubernetes/test-infra/blob/8eefa866327706ca2ed7048e5a53437917d92f0d/kubetest/kops.go#L444-L452
A more reliable approach would be for Kops validation to succeed N consecutive times during a time interval. This PR improves cluster validation by adding a
--count
flag that can be used together with--wait
to achieve this.