-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metricbeat: add configurable failure threshold before reporting streams as degraded #41570
Merged
pchila
merged 10 commits into
elastic:main
from
pchila:add-failure-threshold-for-streams
Nov 19, 2024
Merged
Changes from 7 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6e28f29
Metricbeat: add configurable failure threshold before reporting strea…
pchila 7ce8809
remove async unit test for metribeat stream failureThreshold
pchila 7595020
Change failure threshold to unsigned and change the threshold evaluation
pchila 50e2336
use switch statement in handleFetchError
pchila 4d93fcc
Add unit tests for ReportingMetricSetV2WithContext
pchila 47c90eb
Rename failureThreshold config key to failure_threshold
pchila 25bc54e
linting
pchila b788b43
Fix imports
pchila b32f635
flip case statements in 'metricsetWrapper.handleFetchError()'
pchila caeafcb
move consecutiveFailures counter to metricSetWrapper.stat struct
pchila File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not an expert in MetricBeat/MetricSets code but if
fetch
is called by multiple goroutines here we should go with atomics?! If it shouldn't be called like that maybe we should enrich the fetch godoc to mark it explicitly as a concurrent unsafeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from here is that each metricset is run within its own goroutine so I didn't add any extra synchronization around the counters.
@leehinman can maybe have a look and if there's the chance of a race condition we can easily switch to atomics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move the
consecutiveErrors
to thestats
struct inmetricSetWrapper
, then make it a*monitoring.Int
, that way we observe it and it will be atomicThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leehinman I had a look at using a *monitoring.Int (which is just a struct wrapping an atomic.Int64) and I was quite surprised to find some of the new unit tests failing.
I then realized that
stats
keeps state from previous tests thanks to this and the getMetricSetStats() function.Is it normal that we keep state from a previous wrapper instance (in the unit tests the metricsetWrapper is recreated along with the mocks for each testcase) just matching on metricset name ? I am not sure we want to do the same for the consecutive errors part just because a previous wrapper existed that failed all the time...
Any thoughts ? Is remembering previous states for previous wrappers of the metricset the standard behavior in metricbeat ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a quick zoom with @leehinman we determined that the
stat
struct is shared by design between differentmetricsetWrapper
that may run the same metricset on different hosts from the samemodule
config block to aggregate the success/failures/events counters (and now also the consecutiveFailures)Fixed the tests (releasing correctly the stats structs) in caeafcb