-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metricbeat: add configurable failure threshold before reporting streams as degraded #41570
Metricbeat: add configurable failure threshold before reporting streams as degraded #41570
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some optional style comments but overall LGTM.
76e04f3
to
239eaa5
Compare
Change the failure threshold to be an unsigned integer: - if failureThreshold == 0, the feature is deactivated - if failureThreshold == n, where n > 0, the stream will be marked DEGRADED after n consecutive errors This changes the previous logic that was zero-based, had 2 values for failing after the first error (0 and 1) and was generally weirder to look at (to have a stream fail after 3 errors we had to set failureThreshold=2)
239eaa5
to
4d93fcc
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
metricbeat/mb/module/wrapper.go
Outdated
|
||
case err != nil: | ||
reporter.Error(err) | ||
msw.consecutiveErrors++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not an expert in MetricBeat/MetricSets code but if fetch
is called by multiple goroutines here we should go with atomics?! If it shouldn't be called like that maybe we should enrich the fetch godoc to mark it explicitly as a concurrent unsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from here is that each metricset is run within its own goroutine so I didn't add any extra synchronization around the counters.
@leehinman can maybe have a look and if there's the chance of a race condition we can easily switch to atomics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move the consecutiveErrors
to the stats
struct in metricSetWrapper
, then make it a *monitoring.Int
, that way we observe it and it will be atomic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leehinman I had a look at using a *monitoring.Int (which is just a struct wrapping an atomic.Int64) and I was quite surprised to find some of the new unit tests failing.
I then realized that stats
keeps state from previous tests thanks to this and the getMetricSetStats() function.
Is it normal that we keep state from a previous wrapper instance (in the unit tests the metricsetWrapper is recreated along with the mocks for each testcase) just matching on metricset name ? I am not sure we want to do the same for the consecutive errors part just because a previous wrapper existed that failed all the time...
Any thoughts ? Is remembering previous states for previous wrappers of the metricset the standard behavior in metricbeat ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a quick zoom with @leehinman we determined that the stat
struct is shared by design between different metricsetWrapper
that may run the same metricset on different hosts from the same module
config block to aggregate the success/failures/events counters (and now also the consecutiveFailures)
Fixed the tests (releasing correctly the stats structs) in caeafcb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ms as degraded (#41570) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b)
…ms as degraded (#41570) (#41685) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b) Co-authored-by: Paolo Chilà <[email protected]>
…ms as degraded (#41570) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b)
…ms as degraded (#41570) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b) # Conflicts: # metricbeat/mb/module/wrapper.go
…ms as degraded (#41570) (#41722) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b) Co-authored-by: Paolo Chilà <[email protected]>
Proposed commit message
Add configurable failure threshold before reporting streams as degraded
With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a
"failure_threshold": <n>
to a module configuration block.Depending on the value of
<n>
the threshold will be configured in different ways:failure_threshold
not specified: backward compatible behavior, the stream will become DEGRADED at the first error encounteredn
consecutive errors have been encounteredWhen a
fetch
operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.Checklist
[ ] I have made corresponding change to the default configuration files[ ] I have added an entry inCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
No disruptive user impact since not specifying the new configuration key maintains the previous behavior
Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs