Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] [Response Ops] Onboard detection rules to use alerting framework summaries. #147539

Closed
wants to merge 34 commits into from

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Dec 14, 2022

Towards #147379

Summary

POC to remove custom notification scheduling from detection rules in favor of reporting alerts 1-to-1 back to the alerting platform and using the new alert summaries feature.

In this POC:

  • updated detection rule creation code to inject the correct alert summarization options
  • updated persistence rule type wrapper to report alerts back to the framework
  • updated alerting framework to optimize alerting task runner for persistent (non-lifecycle rule types)
    • framework does not calculate recovery alerts for persistent rule types
    • framework does not serialize alerts into task manager state for persistent rule types
  • custom alias for detection rule context variables to support backward-compatibility with existing notification templates.

Not in this POC:

  • no migration of existing detection rules

@ymao1 ymao1 added the skip-ci label Dec 14, 2022
@kibana-ci
Copy link
Collaborator

kibana-ci commented Dec 14, 2022

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Rules, Alerts and Exceptions ResponseOps Cypress Tests on Security Solution / Alerts timeline Add a non-empty property to default timeline
  • [job] [logs] FTR Configs #32 / detection engine api security and spaces enabled - Group 1 find_rules should be able to find a scheduled action correctly
  • [job] [logs] FTR Configs #32 / detection engine api security and spaces enabled - Group 1 find_rules should be able to find a scheduled action correctly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run daily
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run daily
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run hourly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run hourly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run on every run
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run on every run
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run weekly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 migrate_legacy_actions migrates legacy actions for rule with action run weekly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 patch_rules patch rules should return the rule with migrated actions after the enable patch
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 patch_rules patch rules should return the rule with migrated actions after the enable patch
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 patch_rules_bulk patch rules bulk should bulk disable two rules and migrate their actions
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 patch_rules_bulk patch rules bulk should bulk disable two rules and migrate their actions
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 read_rules reading rules should be able to a read a scheduled action correctly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 read_rules reading rules should be able to a read a scheduled action correctly
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 throttle adding actions creating a rule When creating throttle with "1h" set and actions set, the rule should have its kibana alerting "mute_all" set to "false" and notify_when set to "onThrottleInterval"
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 throttle adding actions creating a rule When creating throttle with "1h" set and actions set, the rule should have its kibana alerting "mute_all" set to "false" and notify_when set to "onThrottleInterval"
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 throttle adding actions creating a rule When creating throttle with "1h" set and no actions, the rule should have its kibana alerting "mute_all" set to "false" and notify_when set to "onThrottleInterval"
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 throttle adding actions creating a rule When creating throttle with "1h" set and no actions, the rule should have its kibana alerting "mute_all" set to "false" and notify_when set to "onThrottleInterval"
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 update_rules update rules should update a single rule property of name using an auto-generated rule_id and migrate the actions
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 update_rules update rules should update a single rule property of name using an auto-generated rule_id and migrate the actions
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 update_rules_bulk update rules bulk should update two rule properties of name using the two rules rule_id and migrate actions
  • [job] [logs] FTR Configs #26 / detection engine api security and spaces enabled - Group 10 update_rules_bulk update rules bulk should update two rule properties of name using the two rules rule_id and migrate actions
  • [job] [logs] Jest Tests #8 / patchRules regression tests updates the rule's actions if provided
  • [job] [logs] Jest Tests #8 / Task Runner actionsPlugin.execute is called per alert alert that is scheduled
  • [job] [logs] Jest Tests #8 / Task Runner actionsPlugin.execute is called per alert alert that is scheduled (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner actionsPlugin.execute is called when notifyWhen=onActionGroupChange and alert state has changed
  • [job] [logs] Jest Tests #8 / Task Runner actionsPlugin.execute is called when notifyWhen=onActionGroupChange and alert state has changed (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner actionsPlugin.execute is not called when notifyWhen=onActionGroupChange and alert state does not change
  • [job] [logs] Jest Tests #8 / Task Runner actionsPlugin.execute is skipped if muteAll is true
  • [job] [logs] Jest Tests #8 / Task Runner fire actions under a custom recovery group when specified on an alert type for alertInstances which are in the recovered state
  • [job] [logs] Jest Tests #8 / Task Runner fire actions under a custom recovery group when specified on an alert type for alertInstances which are in the recovered state (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner fire recovered actions for execution for the alertInstances which is in the recovered state
  • [job] [logs] Jest Tests #8 / Task Runner fire recovered actions for execution for the alertInstances which is in the recovered state (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner includes the apiKey in the request used to initialize the actionsClient
  • [job] [logs] Jest Tests #8 / Task Runner includes the apiKey in the request used to initialize the actionsClient (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner should skip alertInstances which weren't active on the previous execution
  • [job] [logs] Jest Tests #8 / Task Runner should skip alertInstances which weren't active on the previous execution (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner skips firing actions for active alert if alert is muted
  • [job] [logs] Jest Tests #8 / Task Runner skips firing actions for active alert if alert is muted (with ephemeral support)
  • [job] [logs] Jest Tests #8 / Task Runner triggers summary actions (Custom Frequency)
  • [job] [logs] Jest Tests #8 / Task Runner triggers summary actions (Per rule run)

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 39.4KB 39.5KB +22.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@@ -123,7 +132,7 @@ function processAlertsHelper<
updateAlertFlappingHistory(activeAlerts[id], false);
}
}
} else if (existingAlertIds.has(id)) {
} else if (existingAlertIds.has(id) && autoRecoverAlerts) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skips this step for persistent alert rule types (slight time optimization when running a rule)

@@ -197,6 +207,13 @@ export interface RuleType<
cancelAlertsOnRuleTimeout?: boolean;
doesSetRecoveryContext?: boolean;
getSummarizedAlerts?: GetSummarizedAlertsFn;
getRuleUrl?: GetRuleUrlFn<Params>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows rule types to specify custom function for building rule URLs

...alerts,
all: {
count: total,
data: [...alerts.new.data, ...alerts.ongoing.data, ...alerts.recovered.data],
},
};

if (summarizedAlerts.all.count > 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns the time bounds for summarized alerts that can be used for building rule URLs. This ensures that the time bounds used to load alerts in the UI matches the time bounds for the alert summary so the alert counts match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating it from the summarized alerts instead of using existing time bounds because when we query for alerts per rule execution UUID, we don't have existing time bounds.

@@ -32,6 +33,7 @@ interface CreateGetSummarizedAlertsFnOpts {
ruleDataClient: PublicContract<IRuleDataClient>;
useNamespace: boolean;
isLifecycleAlert: boolean;
formatAlert?: (alert: unknown) => unknown;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows rule types to pass in custom functions for formatting summarized alerts. This allows detection rules to be formatted exactly as they were before.

})
.filter((_, idx) => response.body.items[idx].create?.status === 201);

createdAlerts.forEach((alert) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting alerts 1-1 back to the framework.

frequency: {
summary: true,
notifyWhen: 'onThrottleInterval',
throttle: throttle === '1h' ? '5m' : throttle,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting 'hourly' to 5 minutes for easier testing.

@ymao1 ymao1 changed the title Poc/onboard detection rules [POC] [Response Ops] Onboard detection rules to use alerting framework summaries. Dec 20, 2022
.filter((_, idx) => response.body.items[idx].create?.status === 201);

createdAlerts.forEach((alert) =>
options.services.alertFactory.create(alert._id).scheduleActions('default', {})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options.services.alertFactory.create(alert._id).scheduleActions('default', {})
options.services.alertFactory.create(alert._id).scheduleActions('default', {
alerts: [alert],
results_link: <ruleUrl>
}).replaceState({ signals_count: 1});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants