Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

test: more resilient “wait for successful pod readiness checks” #2015

Merged
merged 1 commit into from
Sep 24, 2019

Conversation

jackfrancis
Copy link
Member

Reason for Change:

This PR tweaks the way we validate that a pod or common set of pods is in a persistent, reliable "Ready" state. Instead of tracking and incrementing "successful" checks, and comparing to "unsuccessful" checks, we instead try to detect "flaps", which are in fact normal Kubernetes pod behaviors, and ensure that the transitions between Ready and non-Ready (i.e., a flap) are within a tolerable spectrum.

Specifically:

Once we get a positive Ready signal, we continue attempting to accumulate the desired success count until we detect a non-Ready signal. A ready --> non-ready transition counts as a single "flap". We then wait for the re-occurrence of a successive Ready signal, which then counts towards the targeted number of successes. Rinse, repeat.

We will return successful if the number of desired Ready counts is reached, and is at least 2 more than the number of flaps. In practice this means that a simple check of "just give me signal that one Ready state has occured" works fine (the "flap" detection will never be engaged in this scenario); and that 4 is probably the minimum number of successes to target for scenarios where you want more than 1 Ready count. 2 is super aggressive as basically any one flap will fail the test, and 3 is arguably not tolerant enough for "normal" reconciliation conditions.

As such all of the relevant "wait for more than 1 Ready signal" usages are set to 4.

Also renamed the func to WaitOnSucesses to more easily disambiguate from the single-validation-call WaitOnReady.

Issue Fixed:

Requirements:

Notes:

@acs-bot acs-bot added the size/M label Sep 24, 2019
@acs-bot
Copy link

acs-bot commented Sep 24, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

PrintPodsLogs(podPrefix, namespace)
return false, errors.Errorf("Pods from deployment (%s) in namespace (%s) have been checked out as all Ready %d times, but NotReady %d times. This behavior may mean it is in a crashloop", podPrefix, namespace, successCount, failureCount)
mostRecentWaitOnSuccessesErr = result.err
if mostRecentWaitOnSuccessesErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit, but might be good to invert the if statement to reduce nesting. For example:

if mostRecentWaitOnSuccessesErr != nil {
  continue
}
// do the rest of the things without the if nesting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I propose we bikeshed this sort of abstractly (at later time), and then apply the team-approved pattern generally. Because this pattern is the one we currently use to handle channel responses in these goroutine validations.

@codecov
Copy link

codecov bot commented Sep 24, 2019

Codecov Report

Merging #2015 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2015   +/-   ##
=======================================
  Coverage   76.73%   76.73%           
=======================================
  Files         135      135           
  Lines       20574    20574           
=======================================
  Hits        15787    15787           
  Misses       3871     3871           
  Partials      916      916

@jackfrancis jackfrancis merged commit 50d8411 into Azure:master Sep 24, 2019
@jackfrancis jackfrancis deleted the e2e-pod-successes branch September 24, 2019 21:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants