Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat: add configurable failure threshold before reporting streams as degraded #41570

Merged
merged 10 commits into from
Nov 19, 2024

Conversation

pchila
Copy link
Member

@pchila pchila commented Nov 8, 2024

Proposed commit message

Add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

  • n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
  • n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
  • n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

No disruptive user impact since not specifying the new configuration key maintains the previous behavior

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@pchila pchila self-assigned this Nov 8, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 8, 2024
Copy link
Contributor

mergify bot commented Nov 8, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @pchila? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 8, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 8, 2024
@pchila pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Nov 8, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 8, 2024
Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some optional style comments but overall LGTM.

metricbeat/mb/module/wrapper.go Outdated Show resolved Hide resolved
metricbeat/mb/module/wrapper.go Outdated Show resolved Hide resolved
metricbeat/mb/module/wrapper.go Outdated Show resolved Hide resolved
@pchila pchila force-pushed the add-failure-threshold-for-streams branch from 76e04f3 to 239eaa5 Compare November 12, 2024 09:02
@pchila pchila added release-note:skip The PR should be ignored when processing the changelog and removed release-note:skip The PR should be ignored when processing the changelog labels Nov 12, 2024
Change the failure threshold to be an unsigned integer:
- if failureThreshold == 0, the feature is deactivated
- if failureThreshold == n, where n > 0, the stream will be marked
  DEGRADED after n consecutive errors

This changes the previous logic that was zero-based, had 2 values
for failing after the first error (0 and 1) and was generally weirder to
look at (to have a stream fail after 3 errors we had to set
failureThreshold=2)
@pchila pchila force-pushed the add-failure-threshold-for-streams branch from 239eaa5 to 4d93fcc Compare November 12, 2024 09:08
@pchila pchila marked this pull request as ready for review November 12, 2024 09:09
@pchila pchila requested a review from a team as a code owner November 12, 2024 09:09
@pchila pchila requested a review from mauri870 November 12, 2024 09:09
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)


case err != nil:
reporter.Error(err)
msw.consecutiveErrors++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not an expert in MetricBeat/MetricSets code but if fetch is called by multiple goroutines here we should go with atomics?! If it shouldn't be called like that maybe we should enrich the fetch godoc to mark it explicitly as a concurrent unsafe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from here is that each metricset is run within its own goroutine so I didn't add any extra synchronization around the counters.
@leehinman can maybe have a look and if there's the chance of a race condition we can easily switch to atomics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move the consecutiveErrors to the stats struct in metricSetWrapper, then make it a *monitoring.Int, that way we observe it and it will be atomic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leehinman I had a look at using a *monitoring.Int (which is just a struct wrapping an atomic.Int64) and I was quite surprised to find some of the new unit tests failing.
I then realized that stats keeps state from previous tests thanks to this and the getMetricSetStats() function.

Is it normal that we keep state from a previous wrapper instance (in the unit tests the metricsetWrapper is recreated along with the mocks for each testcase) just matching on metricset name ? I am not sure we want to do the same for the consecutive errors part just because a previous wrapper existed that failed all the time...
Any thoughts ? Is remembering previous states for previous wrappers of the metricset the standard behavior in metricbeat ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick zoom with @leehinman we determined that the stat struct is shared by design between different metricsetWrapper that may run the same metricset on different hosts from the same module config block to aggregate the success/failures/events counters (and now also the consecutiveFailures)
Fixed the tests (releasing correctly the stats structs) in caeafcb

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pchila pchila merged commit f84c05b into elastic:main Nov 19, 2024
31 checks passed
mergify bot pushed a commit that referenced this pull request Nov 19, 2024
…ms as degraded (#41570)

* Metricbeat: add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

    n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
    n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
    n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

(cherry picked from commit f84c05b)
pchila added a commit that referenced this pull request Nov 20, 2024
…ms as degraded (#41570) (#41685)

* Metricbeat: add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

    n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
    n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
    n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

(cherry picked from commit f84c05b)

Co-authored-by: Paolo Chilà <[email protected]>
@cmacknz cmacknz added backport-8.16 Automated backport with mergify backport-8.15 Automated backport to the 8.15 branch with mergify labels Nov 20, 2024
mergify bot pushed a commit that referenced this pull request Nov 20, 2024
…ms as degraded (#41570)

* Metricbeat: add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

    n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
    n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
    n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

(cherry picked from commit f84c05b)
mergify bot pushed a commit that referenced this pull request Nov 20, 2024
…ms as degraded (#41570)

* Metricbeat: add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

    n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
    n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
    n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

(cherry picked from commit f84c05b)

# Conflicts:
#	metricbeat/mb/module/wrapper.go
pierrehilbert pushed a commit that referenced this pull request Nov 22, 2024
…ms as degraded (#41570) (#41722)

* Metricbeat: add configurable failure threshold before reporting streams as degraded

With this change it is possible to configure a threshold for the number of consecutive errors that may happen while fetching metrics for a given stream before the stream gets marked as DEGRADED.
To configure such threshold, add a "failure_threshold": <n> to a module configuration block.
Depending on the value of <n> the threshold will be configured in different ways:

    n == 0: status reporting for the stream has been disabled, the stream will never become DEGRADED no matter how many errors are encountered while fetching metrics
    n==1 or failure_threshold not specified: backward compatible behavior, the stream will become DEGRADED at the first error encountered
    n > 1: stream will become DEGRADED after at least n consecutive errors have been encountered

When a fetch operation completes without errors the consecutive errors counter is reset and the stream is set to HEALTHY.

(cherry picked from commit f84c05b)

Co-authored-by: Paolo Chilà <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify enhancement Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent gets unhealthy temporarily because Beat monitoring sockets are not available
5 participants