Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Active alerts do not recover after re-enabling a rule #111671

Merged

Conversation

YulNaumenko
Copy link
Contributor

@YulNaumenko YulNaumenko commented Sep 9, 2021

Summary

Partially resolves #110080
Based on the multiple discussions about, how alerting framework should resolve the alert instances in the case when the rule was disabled, the decision was made:
to move forward with the recovery option for the short term, with a solution that will not trigger resolve-on-recovery for disabled alerts that are set up to resolve incidents on recovery.

Checklist

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko YulNaumenko added v7.16.0 v8.0.0 Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes bug Fixes for quality problems that affect the customer experience labels Oct 13, 2021
@YulNaumenko YulNaumenko marked this pull request as ready for review October 13, 2021 23:58
@YulNaumenko YulNaumenko requested a review from a team as a code owner October 13, 2021 23:58
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@YulNaumenko
Copy link
Contributor Author

@pmuellr do you think the additional event log flag could be introduced as a separate issue or it is possible to be added in the current scope?

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I verified this works as expected by

  • creating a rule that will have active alerts, let it run
  • disable the rule
  • verify event log docs for recovered-instance are written for each active instance
  • edit the rule so that the condition will no longer be met
  • enable the rule and see Recovered in the Rule Details view

I also tried

  • creating a rule that will have active alerts, let it run
  • disable the rule
  • verify event log docs for recovered-instance are written for each active instance
  • enable the rule (without editing) and I see Recovered in the Rule Details view, but they quickly change back to Active.

I guess in the second scenario if the rule takes a while to execute, you would see Recovered in the UI for the execution duration, which may be a little confusing? But I think that is ok.

Can we add functional tests for this?

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Works as expected

@pmuellr
Copy link
Member

pmuellr commented Oct 14, 2021

do you think the additional event log flag could be introduced as a separate issue or it is possible to be added in the current scope?

I assume a new property in the event log mappings? This would be something we add to the recovered events to indicate that they are "recovered" because the rule was disabled? We'd have to come up with a name and location for it in the doc, etc. Something like kibana.alerting.recovered_because_disabled - doesn't feel right, maybe we need a recovered_reason field or something? Feels like we need some design work to figure out the right way to indicate this in the event log

One thing we could do short-term to make these mostly searchable in the event log, would be to update the message field for recovered events. For instance, add some "easily searchable via KQL" text in the message like recovered due to alert being disabled, or similar.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; love the new createAlertEventLogRecordObject() so we can have a single place (or at least fewer places) to update when we add new fields ...

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @YulNaumenko

@YulNaumenko YulNaumenko merged commit 84df569 into elastic:master Oct 18, 2021
jloleysens added a commit to jloleysens/kibana that referenced this pull request Oct 18, 2021
…-migrate-away-from-injected-css-js

* 'master' of github.com:elastic/kibana: (237 commits)
  [Uptime] Added uptime query inspector panel (elastic#115170)
  [Osquery] Add packs (elastic#107345)
  [App Search] Allow for query parameter to indicate ingestion mechanism for new engines (elastic#115188)
  [Alerting] Active alerts do not recover after re-enabling a rule (elastic#111671)
  skip flaky tests.  elastic#115308, elastic#115313
  [Breaking] Remove deprecated `enabled` settings from plugins. (elastic#113495)
  skip flaky suite.  elastic#107057
  skip flaky tests. elastic#89052, elastic#113418, elastic#115304
  skip flaky test. elastic#113892
  Bump node to 16.11.1 (elastic#110684)
  [Security Solution] Restores Alerts table local storage persistence and the Remove Column action (elastic#114742)
  skip flaky suite.  elastic#115130
  one line remove assert (elastic#115127)
  Fixes migration bug where I was deleting attributes (elastic#115098)
  [Security Solutions] Fixes the newer notification system throttle resets and enabling immediate execution on first detection of a signal  (elastic#114214)
  [build] Dockerfile update (elastic#115237)
  Fixes Cypress flake cypress test (elastic#115270)
  Disable APM e2e tests
  log an invalid type for SO (elastic#115175)
  [Fleet] Don't auto upgrade policies for AUTO_UPDATE packages (elastic#115199)
  ...

# Conflicts:
#	src/plugins/dashboard/public/application/dashboard_app.tsx
#	src/plugins/dashboard/public/types.ts
#	x-pack/plugins/reporting/server/lib/layouts/print_layout.ts
jloleysens added a commit to jloleysens/kibana that referenced this pull request Oct 18, 2021
…-link-to-kibana-app

* 'master' of github.com:elastic/kibana: (287 commits)
  [Security Solution][Endpoint] Change `trustedAppByPolicyEnabled` flag to `true` by default (elastic#115264)
  [APM] generator: support error events and application metrics (elastic#115311)
  [kibanaUtils] Don't import full `semver` client side (elastic#114986)
  [RAC] Link inventory alerts to the right inventory view (elastic#113553)
  [Uptime] Added uptime query inspector panel (elastic#115170)
  [Osquery] Add packs (elastic#107345)
  [App Search] Allow for query parameter to indicate ingestion mechanism for new engines (elastic#115188)
  [Alerting] Active alerts do not recover after re-enabling a rule (elastic#111671)
  skip flaky tests.  elastic#115308, elastic#115313
  [Breaking] Remove deprecated `enabled` settings from plugins. (elastic#113495)
  skip flaky suite.  elastic#107057
  skip flaky tests. elastic#89052, elastic#113418, elastic#115304
  skip flaky test. elastic#113892
  Bump node to 16.11.1 (elastic#110684)
  [Security Solution] Restores Alerts table local storage persistence and the Remove Column action (elastic#114742)
  skip flaky suite.  elastic#115130
  one line remove assert (elastic#115127)
  Fixes migration bug where I was deleting attributes (elastic#115098)
  [Security Solutions] Fixes the newer notification system throttle resets and enabling immediate execution on first detection of a signal  (elastic#114214)
  [build] Dockerfile update (elastic#115237)
  ...

# Conflicts:
#	x-pack/plugins/reporting/public/management/__snapshots__/report_listing.test.tsx.snap
YulNaumenko added a commit that referenced this pull request Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Active alerts do not recover after re-enabling a rule
5 participants