Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for validation to succeed N consecutive times #8515

Merged
merged 2 commits into from
Mar 20, 2020

Conversation

hakman
Copy link
Member

@hakman hakman commented Feb 8, 2020

Validation of the cluster during e2e tests is not that reliable. Nodes are expected to be in Ready state and pass validation once:
https://github.com/kubernetes/test-infra/blob/8eefa866327706ca2ed7048e5a53437917d92f0d/kubetest/kops.go#L444-L452

A more reliable approach would be for Kops validation to succeed N consecutive times during a time interval. This PR improves cluster validation by adding a --count flag that can be used together with --wait to achieve this.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 8, 2020
@hakman hakman force-pushed the validate-wait-consecutive branch from 41e2875 to 2938dd8 Compare February 8, 2020 04:41
@hakman
Copy link
Member Author

hakman commented Feb 8, 2020

/assign @rifelpet
/cc @justinsb

@johngmyers
Copy link
Member

I think we might want to replicate RollingUpdateCluster's ValidateSuccessDuration. Only require two successful validations in a row and possibly allow adjusting the interval after the first successful validation.

From #8088 / #8078 it's kube-controller-manager that causes the flapping and a node tends to only flap once.

@johngmyers
Copy link
Member

...and we should default kops validate cluster --wait to requiring two successful validations.

@hakman
Copy link
Member Author

hakman commented Feb 8, 2020

I think we might want to replicate RollingUpdateCluster's ValidateSuccessDuration. Only require two successful validations in a row and possibly allow adjusting the interval after the first successful validation.

I don't think I want to replicate that, it is an internal feature. This would like to replace node validation on e2e tests side where things will not be predictable. This is why it requires 10 successes in a row. Maybe it can be decreased, but not to 2.

...and we should default kops validate cluster --wait to requiring two successful validations.

This would change the current behaviour of the feature. Not sure this is desired.

@johngmyers
Copy link
Member

Do we have any data on how the e2e flakiness is behaving?

I actually tend to doubt that the needs are all that different. If we're routinely getting flapping, the users of kops validate cluster --wait probably want it to correct for that.

For two validations to be insufficient, the cluster would have to flap at least twice and two of the flapping successes would have to line up with the checks. Our data so far show that nodes flap once upon creation. No matter what we do, we're going to have a trade-off of wall-clock time versus risk of returning success before a flap; I'd prefer we have a fair idea of what the behavior we're addressing is, rather than just throw a wildly guessed set of delays at the process.

@johngmyers
Copy link
Member

johngmyers commented Feb 9, 2020

This code is waiting the whole pollInterval after a successful validation. In the rolling update code, we wait a much shorter amount of time after a successful validation, as time successful before returning to failure was much shorter.

@hakman
Copy link
Member Author

hakman commented Feb 9, 2020

@johngmyers this is NOT about "known" flakiness, it is about making sure the env is stable before running e2e. You cannot predict how a bad change will affect flakiness. You are just assuming that I am expecting a regularly occurring flakiness for which we have a quick fix.

It is not a command that you will run (unless you want to), but is expected to run in e2e tests and replace this: https://github.com/kubernetes/test-infra/blob/8eefa866327706ca2ed7048e5a53437917d92f0d/kubetest/kops.go#L445.

Probably the time interval between successful checks can be reduced, but would like to get some more feedback on this before adding more changes.

@johngmyers
Copy link
Member

What is known as "flakiness" is a too high rate of intermittent false positives, when a change that is good incorrectly fails the tests. When evaluating fixes to these, one may reasonably take as an axiom that the change is good.

One does have to keep in mind the risk of increasing false negatives. Passing validation too early, however, is unlikely to cause the e2e tests to pass when they otherwise would have failed.

Copy link
Member

@zetaab zetaab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakman, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2020
@johngmyers
Copy link
Member

My objections to this still stand. e2e tests should not be a special snowflake and the kops validate cluster command should not be so different from validation during rolling update.

@hakman
Copy link
Member Author

hakman commented Feb 21, 2020

There are cases where clusters are unstable in unpredictable ways before starting e2e. I would rather wait a bit more and know in advance that the cluster is unstable, than investigate the root cause of failing tests. Here is an aexample of such an unstable cluster that passes 4 checks in a row:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-containerd/1227778845420032003/build-log.txt

Rolling update uses 30 seconds between unsuccessful checks and 10 seconds between successful checks.
Validate cluster uses 10 seconds between any check. It was already different before I added the consecutive validations counter.

Rolling update has to be fast. You cannot press a button and to proceed to the next node. 2 successful validations may be good enough there. For sure is better than 1. I don't want to debate statistics.
Validate cluster is a user run command when one wants to make sure the cluster is stable. Anyone can decide how many successful validation is row are good for him. Anyone can press Ctrl+C at any time and stop it.

@johngmyers
Copy link
Member

That log file does not have any output from kops validate cluster. What relevance does it have?

@hakman
Copy link
Member Author

hakman commented Feb 21, 2020

That log file does not have any output from kops validate cluster. What relevance does it have?

That is because kops validate cluster runs after the current simpler node validation sequence which this PR wants to improve.
Look for success count = 4. Should be there 2 times and might have been more if it didn't timeout.

Also you can check the comments in #kops-dev from yesterday:

kops-controller is broken for me in master (including my typo fix PR). the cluster sporadically validates because all the kops-controllers are Running for some seconds before crashing. The error is a timeout talking to the internal API address, so it may be a cilium thing (although I cannot say why). Still, some health checking would have caught this earlier

@johngmyers
Copy link
Member

This PR is to change kops validate cluster. How would that improve the much more naïve and flappy sequence that e2e is using? The flapping of e2e's "validation" proves nothing about the flapping behavior of kops validate cluster.

If kops-controller is crashing, then improving the probability of e2e passing is undesirable.

@hakman
Copy link
Member Author

hakman commented Feb 21, 2020

How would that improve the much more naïve and flappy sequence that e2e is using?

This would replace the naive implementation e2e is using. Validating cluster N times before preceding should improve things.

The flapping of e2e's "validation" proves nothing about the flapping behavior of kops validate cluster.

It proves that cluster can fail in unexpected ways. If you want to to only validate a cluster 2 consecutive times, it is your choice. Anyone can choose how many times to do it in its own setup. I don't see any reason to have very strong opinions here. The default should stay 0 at least until this gets tested a bit more and maybe tweak the intervals.

If kops-controller is crashing, then improving the probability of e2e passing is undesirable.

The intention of this PR it is exactly the opposite. Validation should fail earlier.
Really don't get how it would improve the probability of e2e passing.

@hakman hakman force-pushed the validate-wait-consecutive branch from 2938dd8 to ad247a9 Compare February 21, 2020 14:18
@hakman
Copy link
Member Author

hakman commented Feb 22, 2020

/test pull-kops-e2e-kubernetes-aws

@hakman
Copy link
Member Author

hakman commented Feb 22, 2020

/retest

@johngmyers
Copy link
Member

This would replace the naive implementation e2e is using. Validating cluster N times before preceding should improve things.

I suggest replacing the implementation e2e is using first, first trying it out in a new testgrid job. Get the data on how kops validate cluster is behaving, then suggest improvements to it.

It proves that cluster can fail in unexpected ways.

I don't see how that leads to the conclusion that kops validate cluster should pass N times before allowing e2e to start. And since kops validate cluster will report failure in many situations where current e2e will report success, it doesn't necessarily prove that it can fail in ways that kops validate cluster doesn't expect.

The intention of this PR it is exactly the opposite. Validation should fail earlier.
Really don't get how it would improve the probability of e2e passing.

By waiting for N consecutive checks to pass before proceeding, it prevents the failures during those N times from causing the test to fail.

If, on the other hand, a failure of one of those subsequent cluster validations caused e2e to fail, that would have likely caught the change that caused kops-controller to crash.

@hakman
Copy link
Member Author

hakman commented Mar 2, 2020

As discussed during office hours, validation will fail if any of the consecutive validations fails.

@hakman
Copy link
Member Author

hakman commented Mar 3, 2020

@johngmyers can you check if it looks ok now?

@hakman
Copy link
Member Author

hakman commented Mar 20, 2020

Fixes #8743

Copy link
Member

@zetaab zetaab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 20, 2020
@hakman
Copy link
Member Author

hakman commented Mar 20, 2020

Thanks @zetaab :)

@hakman
Copy link
Member Author

hakman commented Mar 20, 2020

/retest

@k8s-ci-robot k8s-ci-robot merged commit 51e8563 into kubernetes:master Mar 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Mar 20, 2020
@hakman hakman deleted the validate-wait-consecutive branch March 27, 2020 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants