Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks to pass critical only after reaching a failure threshold #5739

Merged
merged 1 commit into from
Oct 14, 2019

Conversation

PHBourquin
Copy link
Contributor

A check may be set to become critical only if a specified number of successive
checks return critical in a row. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.

@hashicorp-cla
Copy link

hashicorp-cla commented Apr 30, 2019

CLA assistant check
All committers have signed the CLA.

@PHBourquin PHBourquin force-pushed the check-failure-count branch 2 times, most recently from 7309450 to 5ddddae Compare May 5, 2019 15:12
statusHandler.updateCheck(checkID, api.HealthCritical, "bar")

retry.Run(t, func(r *retry.R) {
if got, want := notif.Updates("foo"), 1; got != want {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the require lib for this which is easier to read.
In this case

require.Equal(t, 1, notif.Updates("foo"))

@PHBourquin PHBourquin force-pushed the check-failure-count branch from 5ddddae to 5a6aa68 Compare May 6, 2019 20:24
@banks banks added the type/enhancement Proposed improvement or new feature label Jun 4, 2019
@banks banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Jul 2, 2019
@hanshasselberg
Copy link
Member

@PHBourquin Thanks for your PR, this would be a nice addition.
As long as we stick to counters to store we don't have any concerns about our memory consumption. I was wondering what you think about adding the same for marking a service as healthy?

Would you be able to rebase your PR so that we can review it?

@PHBourquin PHBourquin force-pushed the check-failure-count branch from 5a6aa68 to 442148d Compare July 3, 2019 13:16
@PHBourquin
Copy link
Contributor Author

@i0rek Rebase done :)
Regarding doing the same thing for marking a service as healthy, I'm not seeing the benefits of such thing ; we'd like to avoid sending 2 notifications passing -> critical -> passing if a service's instance is flapping (GC occuring during health check for exemple), but I don't see in which scenario would we want to delay a service passing to healthy while the instance is passing the HC ?

@shantanugadgil
Copy link
Contributor

@PHBourquin The count for success in addition to failure is typically already found at many places like AWS LB configuration, HAProxy rise/fall options, any typical health check prober. :)

Using only 1 for healthy is ok, but I typically use 2 healthy probes for success and 5 unhealthy probes for failure. (Just my thought)

Great work on the PR, btw.

@PHBourquin PHBourquin force-pushed the check-failure-count branch 3 times, most recently from 24409ee to 6d3fbc2 Compare September 12, 2019 13:27
@banks banks removed the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Oct 1, 2019
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this has sat for so long @PHBourquin there is clear interest and I think it's looking good.

I can see one issue with the threshold logic.

website/source/docs/agent/checks.html.md Outdated Show resolved Hide resolved
website/source/docs/agent/checks.html.md Outdated Show resolved Hide resolved
return
}
s.logger.Printf("[WARN] agent: Check %q failed but has not reached failure threshold %d/%d", checkID, s.failuresCounter, s.failuresBeforeCritical)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never reset the counters which means that the thresholds will only work the first time round doesn't it?

e.g. if you have successBeforePassing = 2, failuresBeforeCritical = 2 then you get the sequence: OK, OK, OK, FAIL, FAIL, FAIL, OK

The check should not yet be passing again since there is only one OK since the last FAIL but the state here would be: successCounter = 4, successBeforePassing = 2 and so s.successCounter >= s.successBeforePassing would be true.

Unless I'm missing something? I don't think there is a test case that exercised the path beyond a single threshold traversal which would presumably catch this issue.

I think it's really easy to fix though - just need to reset success counter when failure threshold hits and vice versa but some through table tests would be good here because if you reset it too soon you also break the thresholding effect. I think there are at least 6-8 interesting check histories to enumerate here and test that this does the right think in each to be confident in the logic so a table test seems like an efficient way to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the counter are not reset after different checks as they should ; counters are now set to 0 after updateCheck() with the opposite status is called, plus a test to ensure that sequences such as FAIL PASS FAIL FAIL won't update status to critical if threshold is set to 3
I think the current tests bring good coverage, but if you think there is added value to go for the full table we can go for it, your call.

Thanks for the catch !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree those tests seem to cover the cases. I have a preference for table tests over multiple long story-based tests for cases like this because it's easier to read the table and see which cases are tested than having to keep state in your head as you read but I think this is fine for now.

website/source/docs/agent/checks.html.md Show resolved Hide resolved
@PHBourquin PHBourquin force-pushed the check-failure-count branch 3 times, most recently from a667e3c to a7f2ea1 Compare October 14, 2019 16:16
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great thanks!

return
}
s.logger.Printf("[WARN] agent: Check %q failed but has not reached failure threshold %d/%d", checkID, s.failuresCounter, s.failuresBeforeCritical)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree those tests seem to cover the cases. I have a preference for table tests over multiple long story-based tests for cases like this because it's easier to read the table and see which cases are tested than having to keep state in your head as you read but I think this is fine for now.

checks return passing/critical. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
By default, will be set to 0 for both passing & critical - a single check is needed to update status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, will be set to 0 for both passing & critical - a single check is needed to update status.
By default, both passing and critical thresholds will be set to 0 so the check status will always reflect the last check result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHBourquin are you able to commit this tweak? If not I can make it but I can't commit to your fork so it makes it a bit more cumbersome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it

@banks
Copy link
Member

banks commented Oct 14, 2019

Thanks, I think this is read to merge - the integration test failure is a tooling issue with a change in nomad master and fixed in another adjacent PR - the acceptance tests with released versions of nomad all pass so this is OK to go.

…failure threshold

A check may be set to become passing/critical only if a specified number of successive
checks return passing/critical in a row. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
@banks
Copy link
Member

banks commented Oct 14, 2019

Thanks @PHBourquin! This is great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants