Metricbeat: add configurable failure threshold before reporting streams as degraded #41570

pchila · 2024-11-08T16:11:03Z

Proposed commit message

Add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

Disruptive User Impact

No disruptive user impact since not specifying the new configuration key maintains the previous behavior

Author's Checklist

[ ]

How to test this PR locally

Related issues

Relates Agent gets unhealthy temporarily because Beat monitoring sockets are not available elastic-agent#5332

Use cases

Screenshots

Logs

mergify · 2024-11-08T16:11:41Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @pchila? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-11-08T16:11:41Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

leehinman

Some optional style comments but overall LGTM.

metricbeat/mb/module/wrapper.go

…ms as degraded

Change the failure threshold to be an unsigned integer: - if failureThreshold == 0, the feature is deactivated - if failureThreshold == n, where n > 0, the stream will be marked DEGRADED after n consecutive errors This changes the previous logic that was zero-based, had 2 values for failing after the first error (0 and 1) and was generally weirder to look at (to have a stream fail after 3 errors we had to set failureThreshold=2)

elasticmachine · 2024-11-12T09:09:27Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2024-11-12T09:09:28Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

pkoutsovasilis · 2024-11-12T10:41:16Z

metricbeat/mb/module/wrapper.go

+
+	case err != nil:
+		reporter.Error(err)
+		msw.consecutiveErrors++


nit: I am not an expert in MetricBeat/MetricSets code but if fetch is called by multiple goroutines here we should go with atomics?! If it shouldn't be called like that maybe we should enrich the fetch godoc to mark it explicitly as a concurrent unsafe

My understanding from here is that each metricset is run within its own goroutine so I didn't add any extra synchronization around the counters.
@leehinman can maybe have a look and if there's the chance of a race condition we can easily switch to atomics

Lets move the consecutiveErrors to the stats struct in metricSetWrapper, then make it a *monitoring.Int, that way we observe it and it will be atomic

@leehinman I had a look at using a *monitoring.Int (which is just a struct wrapping an atomic.Int64) and I was quite surprised to find some of the new unit tests failing.
I then realized that stats keeps state from previous tests thanks to this and the getMetricSetStats() function.

Is it normal that we keep state from a previous wrapper instance (in the unit tests the metricsetWrapper is recreated along with the mocks for each testcase) just matching on metricset name ? I am not sure we want to do the same for the consecutive errors part just because a previous wrapper existed that failed all the time...
Any thoughts ? Is remembering previous states for previous wrappers of the metricset the standard behavior in metricbeat ?

After a quick zoom with @leehinman we determined that the stat struct is shared by design between different metricsetWrapper that may run the same metricset on different hosts from the same module config block to aggregate the success/failures/events counters (and now also the consecutiveFailures)
Fixed the tests (releasing correctly the stats structs) in caeafcb

leehinman

LGTM

pkoutsovasilis

LGTM

…ms as degraded (#41570) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b)

…ms as degraded (#41570) (#41685) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b) Co-authored-by: Paolo Chilà <[email protected]>

…ms as degraded (#41570) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b)

…ms as degraded (#41570) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b) # Conflicts: # metricbeat/mb/module/wrapper.go

…ms as degraded (#41570) (#41722) * Metricbeat: add configurable failure threshold before reporting streams as degraded With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED. To configure such threshold, add a "failure_threshold": <n> to a module configuration block. Depending on the value of <n> the threshold will be configured in different ways: n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY. (cherry picked from commit f84c05b) Co-authored-by: Paolo Chilà <[email protected]>

pchila added the enhancement label Nov 8, 2024

pchila requested review from pkoutsovasilis, leehinman and VihasMakwana November 8, 2024 16:11

pchila self-assigned this Nov 8, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 8, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 8, 2024

pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Nov 8, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 8, 2024

leehinman reviewed Nov 8, 2024

View reviewed changes

metricbeat/mb/module/wrapper.go Outdated Show resolved Hide resolved

metricbeat/mb/module/wrapper.go Outdated Show resolved Hide resolved

metricbeat/mb/module/wrapper.go Outdated Show resolved Hide resolved

pchila added 2 commits November 12, 2024 10:02

Metricbeat: add configurable failure threshold before reporting strea…

6e28f29

…ms as degraded

remove async unit test for metribeat stream failureThreshold

7ce8809

pchila force-pushed the add-failure-threshold-for-streams branch from 76e04f3 to 239eaa5 Compare November 12, 2024 09:02

pchila added release-note:skip The PR should be ignored when processing the changelog and removed release-note:skip The PR should be ignored when processing the changelog labels Nov 12, 2024

pchila added 3 commits November 12, 2024 10:08

use switch statement in handleFetchError

50e2336

Add unit tests for ReportingMetricSetV2WithContext

4d93fcc

pchila force-pushed the add-failure-threshold-for-streams branch from 239eaa5 to 4d93fcc Compare November 12, 2024 09:08

pchila marked this pull request as ready for review November 12, 2024 09:09

pchila requested a review from a team as a code owner November 12, 2024 09:09

pchila requested a review from mauri870 November 12, 2024 09:09

pchila added 2 commits November 12, 2024 11:10

Rename failureThreshold config key to failure_threshold

47c90eb

linting

25bc54e

pkoutsovasilis reviewed Nov 12, 2024

View reviewed changes

pchila mentioned this pull request Nov 12, 2024

Add failureThreshold to elastic-agent self-monitoring config elastic/elastic-agent#5999

Merged

3 tasks

pchila added 2 commits November 12, 2024 15:35

Fix imports

b788b43

flip case statements in 'metricsetWrapper.handleFetchError()'

b32f635

pchila requested a review from leehinman November 13, 2024 14:47

pchila linked an issue Nov 14, 2024 that may be closed by this pull request

Agent gets unhealthy temporarily because Beat monitoring sockets are not available elastic/elastic-agent#5332

Closed

move consecutiveFailures counter to metricSetWrapper.stat struct

caeafcb

leehinman approved these changes Nov 18, 2024

View reviewed changes

pkoutsovasilis approved these changes Nov 19, 2024

View reviewed changes

pchila merged commit f84c05b into elastic:main Nov 19, 2024
31 checks passed

This was referenced Nov 19, 2024

[8.x](backport #41570) Metricbeat: add configurable failure threshold before reporting streams as degraded #41685

Merged

[8.x](backport #5999) Add failureThreshold to elastic-agent self-monitoring config elastic/elastic-agent#6090

Merged

cmacknz added backport-8.16 Automated backport with mergify backport-8.15 Automated backport to the 8.15 branch with mergify labels Nov 20, 2024

mergify bot mentioned this pull request Nov 20, 2024

[8.15](backport #41570) Metricbeat: add configurable failure threshold before reporting streams as degraded #41723

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metricbeat: add configurable failure threshold before reporting streams as degraded #41570

Metricbeat: add configurable failure threshold before reporting streams as degraded #41570

pchila commented Nov 8, 2024 •

edited

Loading

mergify bot commented Nov 8, 2024

mergify bot commented Nov 8, 2024

leehinman left a comment

elasticmachine commented Nov 12, 2024

elasticmachine commented Nov 12, 2024

pkoutsovasilis Nov 12, 2024

pchila Nov 12, 2024

leehinman Nov 14, 2024

pchila Nov 18, 2024

pchila Nov 18, 2024

leehinman left a comment

pkoutsovasilis left a comment

Metricbeat: add configurable failure threshold before reporting streams as degraded #41570

Metricbeat: add configurable failure threshold before reporting streams as degraded #41570

Conversation

pchila commented Nov 8, 2024 • edited Loading

Proposed commit message

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

mergify bot commented Nov 8, 2024

mergify bot commented Nov 8, 2024

leehinman left a comment

Choose a reason for hiding this comment

elasticmachine commented Nov 12, 2024

elasticmachine commented Nov 12, 2024

pkoutsovasilis Nov 12, 2024

Choose a reason for hiding this comment

pchila Nov 12, 2024

Choose a reason for hiding this comment

leehinman Nov 14, 2024

Choose a reason for hiding this comment

pchila Nov 18, 2024

Choose a reason for hiding this comment

pchila Nov 18, 2024

Choose a reason for hiding this comment

leehinman left a comment

Choose a reason for hiding this comment

pkoutsovasilis left a comment

Choose a reason for hiding this comment

pchila commented Nov 8, 2024 •

edited

Loading