test: more resilient “wait for successful pod readiness checks” #2015

jackfrancis · 2019-09-24T00:27:28Z

Reason for Change:

This PR tweaks the way we validate that a pod or common set of pods is in a persistent, reliable "Ready" state. Instead of tracking and incrementing "successful" checks, and comparing to "unsuccessful" checks, we instead try to detect "flaps", which are in fact normal Kubernetes pod behaviors, and ensure that the transitions between Ready and non-Ready (i.e., a flap) are within a tolerable spectrum.

Specifically:

Once we get a positive Ready signal, we continue attempting to accumulate the desired success count until we detect a non-Ready signal. A ready --> non-ready transition counts as a single "flap". We then wait for the re-occurrence of a successive Ready signal, which then counts towards the targeted number of successes. Rinse, repeat.

We will return successful if the number of desired Ready counts is reached, and is at least 2 more than the number of flaps. In practice this means that a simple check of "just give me signal that one Ready state has occured" works fine (the "flap" detection will never be engaged in this scenario); and that 4 is probably the minimum number of successes to target for scenarios where you want more than 1 Ready count. 2 is super aggressive as basically any one flap will fail the test, and 3 is arguably not tolerant enough for "normal" reconciliation conditions.

As such all of the relevant "wait for more than 1 Ready signal" usages are set to 4.

Also renamed the func to WaitOnSucesses to more easily disambiguate from the single-validation-call WaitOnReady.

Issue Fixed:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

acs-bot · 2019-09-24T00:29:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

devigned · 2019-09-24T00:35:32Z

test/e2e/kubernetes/pod/pod.go

-						PrintPodsLogs(podPrefix, namespace)
-						return false, errors.Errorf("Pods from deployment (%s) in namespace (%s) have been checked out as all Ready %d times, but NotReady %d times. This behavior may mean it is in a crashloop", podPrefix, namespace, successCount, failureCount)
+			mostRecentWaitOnSuccessesErr = result.err
+			if mostRecentWaitOnSuccessesErr == nil {


Total nit, but might be good to invert the if statement to reduce nesting. For example:

if mostRecentWaitOnSuccessesErr != nil { continue } // do the rest of the things without the if nesting

This is interesting. I propose we bikeshed this sort of abstractly (at later time), and then apply the team-approved pattern generally. Because this pattern is the one we currently use to handle channel responses in these goroutine validations.

codecov · 2019-09-24T01:01:44Z

Codecov Report

Merging #2015 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2015   +/-   ##
=======================================
  Coverage   76.73%   76.73%           
=======================================
  Files         135      135           
  Lines       20574    20574           
=======================================
  Hits        15787    15787           
  Misses       3871     3871           
  Partials      916      916

test: more resilient “wait for successful pod readiness checks”

1fda87f

acs-bot added the size/M label Sep 24, 2019

acs-bot added the approved label Sep 24, 2019

devigned reviewed Sep 24, 2019

View reviewed changes

jackfrancis merged commit 50d8411 into Azure:master Sep 24, 2019

jackfrancis deleted the e2e-pod-successes branch September 24, 2019 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: more resilient “wait for successful pod readiness checks” #2015

test: more resilient “wait for successful pod readiness checks” #2015

jackfrancis commented Sep 24, 2019

acs-bot commented Sep 24, 2019

devigned Sep 24, 2019

jackfrancis Sep 24, 2019

codecov bot commented Sep 24, 2019

test: more resilient “wait for successful pod readiness checks” #2015

test: more resilient “wait for successful pod readiness checks” #2015

Conversation

jackfrancis commented Sep 24, 2019

acs-bot commented Sep 24, 2019

devigned Sep 24, 2019

Choose a reason for hiding this comment

jackfrancis Sep 24, 2019

Choose a reason for hiding this comment

codecov bot commented Sep 24, 2019

Codecov Report