Wait for validation to succeed N consecutive times #8515

hakman · 2020-02-08T04:32:51Z

Validation of the cluster during e2e tests is not that reliable. Nodes are expected to be in Ready state and pass validation once:
https://github.com/kubernetes/test-infra/blob/8eefa866327706ca2ed7048e5a53437917d92f0d/kubetest/kops.go#L444-L452

A more reliable approach would be for Kops validation to succeed N consecutive times during a time interval. This PR improves cluster validation by adding a --count flag that can be used together with --wait to achieve this.

hakman · 2020-02-08T04:56:21Z

/assign @rifelpet
/cc @justinsb

johngmyers · 2020-02-08T05:58:08Z

I think we might want to replicate RollingUpdateCluster's ValidateSuccessDuration. Only require two successful validations in a row and possibly allow adjusting the interval after the first successful validation.

From #8088 / #8078 it's kube-controller-manager that causes the flapping and a node tends to only flap once.

johngmyers · 2020-02-08T06:00:23Z

...and we should default kops validate cluster --wait to requiring two successful validations.

hakman · 2020-02-08T08:23:46Z

I think we might want to replicate RollingUpdateCluster's ValidateSuccessDuration. Only require two successful validations in a row and possibly allow adjusting the interval after the first successful validation.

I don't think I want to replicate that, it is an internal feature. This would like to replace node validation on e2e tests side where things will not be predictable. This is why it requires 10 successes in a row. Maybe it can be decreased, but not to 2.

...and we should default kops validate cluster --wait to requiring two successful validations.

This would change the current behaviour of the feature. Not sure this is desired.

johngmyers · 2020-02-09T01:12:08Z

Do we have any data on how the e2e flakiness is behaving?

I actually tend to doubt that the needs are all that different. If we're routinely getting flapping, the users of kops validate cluster --wait probably want it to correct for that.

For two validations to be insufficient, the cluster would have to flap at least twice and two of the flapping successes would have to line up with the checks. Our data so far show that nodes flap once upon creation. No matter what we do, we're going to have a trade-off of wall-clock time versus risk of returning success before a flap; I'd prefer we have a fair idea of what the behavior we're addressing is, rather than just throw a wildly guessed set of delays at the process.

johngmyers · 2020-02-09T01:15:36Z

This code is waiting the whole pollInterval after a successful validation. In the rolling update code, we wait a much shorter amount of time after a successful validation, as time successful before returning to failure was much shorter.

hakman · 2020-02-09T08:31:46Z

@johngmyers this is NOT about "known" flakiness, it is about making sure the env is stable before running e2e. You cannot predict how a bad change will affect flakiness. You are just assuming that I am expecting a regularly occurring flakiness for which we have a quick fix.

It is not a command that you will run (unless you want to), but is expected to run in e2e tests and replace this: https://github.com/kubernetes/test-infra/blob/8eefa866327706ca2ed7048e5a53437917d92f0d/kubetest/kops.go#L445.

Probably the time interval between successful checks can be reduced, but would like to get some more feedback on this before adding more changes.

johngmyers · 2020-02-09T18:20:34Z

What is known as "flakiness" is a too high rate of intermittent false positives, when a change that is good incorrectly fails the tests. When evaluating fixes to these, one may reasonably take as an axiom that the change is good.

One does have to keep in mind the risk of increasing false negatives. Passing validation too early, however, is unlikely to cause the e2e tests to pass when they otherwise would have failed.

zetaab

/approve

k8s-ci-robot · 2020-02-20T13:34:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakman, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zetaab]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johngmyers · 2020-02-20T16:34:58Z

My objections to this still stand. e2e tests should not be a special snowflake and the kops validate cluster command should not be so different from validation during rolling update.

hakman · 2020-02-21T04:57:43Z

There are cases where clusters are unstable in unpredictable ways before starting e2e. I would rather wait a bit more and know in advance that the cluster is unstable, than investigate the root cause of failing tests. Here is an aexample of such an unstable cluster that passes 4 checks in a row:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-containerd/1227778845420032003/build-log.txt

Rolling update uses 30 seconds between unsuccessful checks and 10 seconds between successful checks.
Validate cluster uses 10 seconds between any check. It was already different before I added the consecutive validations counter.

Rolling update has to be fast. You cannot press a button and to proceed to the next node. 2 successful validations may be good enough there. For sure is better than 1. I don't want to debate statistics.
Validate cluster is a user run command when one wants to make sure the cluster is stable. Anyone can decide how many successful validation is row are good for him. Anyone can press Ctrl+C at any time and stop it.

johngmyers · 2020-02-21T05:34:23Z

That log file does not have any output from kops validate cluster. What relevance does it have?

hakman · 2020-02-21T05:44:24Z

That log file does not have any output from kops validate cluster. What relevance does it have?

That is because kops validate cluster runs after the current simpler node validation sequence which this PR wants to improve.
Look for success count = 4. Should be there 2 times and might have been more if it didn't timeout.

Also you can check the comments in #kops-dev from yesterday:

kops-controller is broken for me in master (including my typo fix PR). the cluster sporadically validates because all the kops-controllers are Running for some seconds before crashing. The error is a timeout talking to the internal API address, so it may be a cilium thing (although I cannot say why). Still, some health checking would have caught this earlier

johngmyers · 2020-02-21T06:47:11Z

This PR is to change kops validate cluster. How would that improve the much more naïve and flappy sequence that e2e is using? The flapping of e2e's "validation" proves nothing about the flapping behavior of kops validate cluster.

If kops-controller is crashing, then improving the probability of e2e passing is undesirable.

hakman · 2020-02-21T07:34:06Z

How would that improve the much more naïve and flappy sequence that e2e is using?

This would replace the naive implementation e2e is using. Validating cluster N times before preceding should improve things.

The flapping of e2e's "validation" proves nothing about the flapping behavior of kops validate cluster.

It proves that cluster can fail in unexpected ways. If you want to to only validate a cluster 2 consecutive times, it is your choice. Anyone can choose how many times to do it in its own setup. I don't see any reason to have very strong opinions here. The default should stay 0 at least until this gets tested a bit more and maybe tweak the intervals.

If kops-controller is crashing, then improving the probability of e2e passing is undesirable.

The intention of this PR it is exactly the opposite. Validation should fail earlier.
Really don't get how it would improve the probability of e2e passing.

hakman · 2020-02-22T15:47:23Z

/test pull-kops-e2e-kubernetes-aws

hakman · 2020-02-22T15:55:36Z

/retest

johngmyers · 2020-02-22T19:29:49Z

This would replace the naive implementation e2e is using. Validating cluster N times before preceding should improve things.

I suggest replacing the implementation e2e is using first, first trying it out in a new testgrid job. Get the data on how kops validate cluster is behaving, then suggest improvements to it.

It proves that cluster can fail in unexpected ways.

I don't see how that leads to the conclusion that kops validate cluster should pass N times before allowing e2e to start. And since kops validate cluster will report failure in many situations where current e2e will report success, it doesn't necessarily prove that it can fail in ways that kops validate cluster doesn't expect.

The intention of this PR it is exactly the opposite. Validation should fail earlier.
Really don't get how it would improve the probability of e2e passing.

By waiting for N consecutive checks to pass before proceeding, it prevents the failures during those N times from causing the test to fail.

If, on the other hand, a failure of one of those subsequent cluster validations caused e2e to fail, that would have likely caught the change that caused kops-controller to crash.

hakman · 2020-03-02T16:38:39Z

As discussed during office hours, validation will fail if any of the consecutive validations fails.

hakman · 2020-03-03T18:25:08Z

@johngmyers can you check if it looks ok now?

hakman · 2020-03-20T10:55:22Z

Fixes #8743

zetaab

/lgtm

hakman · 2020-03-20T11:08:43Z

Thanks @zetaab :)

hakman · 2020-03-20T11:27:26Z

/retest

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 8, 2020

k8s-ci-robot requested review from joshbranham and robinpercy February 8, 2020 04:33

hakman force-pushed the validate-wait-consecutive branch from 41e2875 to 2938dd8 Compare February 8, 2020 04:41

k8s-ci-robot assigned rifelpet Feb 8, 2020

k8s-ci-robot requested a review from justinsb February 8, 2020 04:56

zetaab approved these changes Feb 20, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2020

Wait for validation to succeed N consecutive times

ad247a9

hakman force-pushed the validate-wait-consecutive branch from 2938dd8 to ad247a9 Compare February 21, 2020 14:18

Fail validation if any consecutive validation fails

e6076ee

hakman mentioned this pull request Mar 14, 2020

Kops verify --wait doesn't respect the limit #8743

Closed

zetaab approved these changes Mar 20, 2020

View reviewed changes

k8s-ci-robot assigned zetaab Mar 20, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 20, 2020

k8s-ci-robot merged commit 51e8563 into kubernetes:master Mar 20, 2020

k8s-ci-robot added this to the v1.18 milestone Mar 20, 2020

hakman deleted the validate-wait-consecutive branch March 27, 2020 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for validation to succeed N consecutive times #8515

Wait for validation to succeed N consecutive times #8515

hakman commented Feb 8, 2020 •

edited

Loading

hakman commented Feb 8, 2020

johngmyers commented Feb 8, 2020

johngmyers commented Feb 8, 2020

hakman commented Feb 8, 2020 •

edited

Loading

johngmyers commented Feb 9, 2020

johngmyers commented Feb 9, 2020 •

edited

Loading

hakman commented Feb 9, 2020

johngmyers commented Feb 9, 2020

zetaab left a comment

k8s-ci-robot commented Feb 20, 2020

johngmyers commented Feb 20, 2020

hakman commented Feb 21, 2020

johngmyers commented Feb 21, 2020

hakman commented Feb 21, 2020

johngmyers commented Feb 21, 2020

hakman commented Feb 21, 2020

hakman commented Feb 22, 2020

hakman commented Feb 22, 2020

johngmyers commented Feb 22, 2020

hakman commented Mar 2, 2020

hakman commented Mar 3, 2020

hakman commented Mar 20, 2020

zetaab left a comment

hakman commented Mar 20, 2020

hakman commented Mar 20, 2020

Wait for validation to succeed N consecutive times #8515

Wait for validation to succeed N consecutive times #8515

Conversation

hakman commented Feb 8, 2020 • edited Loading

hakman commented Feb 8, 2020

johngmyers commented Feb 8, 2020

johngmyers commented Feb 8, 2020

hakman commented Feb 8, 2020 • edited Loading

johngmyers commented Feb 9, 2020

johngmyers commented Feb 9, 2020 • edited Loading

hakman commented Feb 9, 2020

johngmyers commented Feb 9, 2020

zetaab left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 20, 2020

johngmyers commented Feb 20, 2020

hakman commented Feb 21, 2020

johngmyers commented Feb 21, 2020

hakman commented Feb 21, 2020

johngmyers commented Feb 21, 2020

hakman commented Feb 21, 2020

hakman commented Feb 22, 2020

hakman commented Feb 22, 2020

johngmyers commented Feb 22, 2020

hakman commented Mar 2, 2020

hakman commented Mar 3, 2020

hakman commented Mar 20, 2020

zetaab left a comment

Choose a reason for hiding this comment

hakman commented Mar 20, 2020

hakman commented Mar 20, 2020

hakman commented Feb 8, 2020 •

edited

Loading

hakman commented Feb 8, 2020 •

edited

Loading

johngmyers commented Feb 9, 2020 •

edited

Loading