-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checks to pass critical only after reaching a failure threshold #5739
Conversation
7309450
to
5ddddae
Compare
agent/checks/check_test.go
Outdated
statusHandler.updateCheck(checkID, api.HealthCritical, "bar") | ||
|
||
retry.Run(t, func(r *retry.R) { | ||
if got, want := notif.Updates("foo"), 1; got != want { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the require
lib for this which is easier to read.
In this case
require.Equal(t, 1, notif.Updates("foo"))
5ddddae
to
5a6aa68
Compare
@PHBourquin Thanks for your PR, this would be a nice addition. Would you be able to rebase your PR so that we can review it? |
5a6aa68
to
442148d
Compare
@i0rek Rebase done :) |
@PHBourquin The count for success in addition to failure is typically already found at many places like AWS LB configuration, HAProxy rise/fall options, any typical health check prober. :) Using only 1 for healthy is ok, but I typically use 2 healthy probes for success and 5 unhealthy probes for failure. (Just my thought) Great work on the PR, btw. |
24409ee
to
6d3fbc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this has sat for so long @PHBourquin there is clear interest and I think it's looking good.
I can see one issue with the threshold logic.
return | ||
} | ||
s.logger.Printf("[WARN] agent: Check %q failed but has not reached failure threshold %d/%d", checkID, s.failuresCounter, s.failuresBeforeCritical) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never reset the counters which means that the thresholds will only work the first time round doesn't it?
e.g. if you have successBeforePassing = 2, failuresBeforeCritical = 2
then you get the sequence: OK, OK, OK, FAIL, FAIL, FAIL, OK
The check should not yet be passing again since there is only one OK since the last FAIL but the state here would be: successCounter = 4, successBeforePassing = 2
and so s.successCounter >= s.successBeforePassing
would be true.
Unless I'm missing something? I don't think there is a test case that exercised the path beyond a single threshold traversal which would presumably catch this issue.
I think it's really easy to fix though - just need to reset success counter when failure threshold hits and vice versa but some through table tests would be good here because if you reset it too soon you also break the thresholding effect. I think there are at least 6-8 interesting check histories to enumerate here and test that this does the right think in each to be confident in the logic so a table test seems like an efficient way to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the counter are not reset after different checks as they should ; counters are now set to 0 after updateCheck() with the opposite status is called, plus a test to ensure that sequences such as FAIL PASS FAIL FAIL won't update status to critical if threshold is set to 3
I think the current tests bring good coverage, but if you think there is added value to go for the full table we can go for it, your call.
Thanks for the catch !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree those tests seem to cover the cases. I have a preference for table tests over multiple long story-based tests for cases like this because it's easier to read the table and see which cases are tested than having to keep state in your head as you read but I think this is fine for now.
a667e3c
to
a7f2ea1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great thanks!
return | ||
} | ||
s.logger.Printf("[WARN] agent: Check %q failed but has not reached failure threshold %d/%d", checkID, s.failuresCounter, s.failuresBeforeCritical) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree those tests seem to cover the cases. I have a preference for table tests over multiple long story-based tests for cases like this because it's easier to read the table and see which cases are tested than having to keep state in your head as you read but I think this is fine for now.
checks return passing/critical. Status will stay identical as before until | ||
the threshold is reached. | ||
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks. | ||
By default, will be set to 0 for both passing & critical - a single check is needed to update status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, will be set to 0 for both passing & critical - a single check is needed to update status. | |
By default, both passing and critical thresholds will be set to 0 so the check status will always reflect the last check result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHBourquin are you able to commit this tweak? If not I can make it but I can't commit to your fork so it makes it a bit more cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On it
Thanks, I think this is read to merge - the integration test failure is a tooling issue with a change in nomad master and fixed in another adjacent PR - the acceptance tests with released versions of nomad all pass so this is OK to go. |
…failure threshold A check may be set to become passing/critical only if a specified number of successive checks return passing/critical in a row. Status will stay identical as before until the threshold is reached. This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
a7f2ea1
to
c40afbf
Compare
Thanks @PHBourquin! This is great. |
A check may be set to become critical only if a specified number of successive
checks return critical in a row. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.