Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops] Onboard metric threshold rule type to use framework alerts as data #166664

Merged
merged 7 commits into from
Oct 18, 2023

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Sep 18, 2023

Resolves #164220

Summary

Removes the lifecycle executor wrapper around the metric threshold rule type executor so that this rule type is using the framework alerts client to write alerts as data documents.

Response ops changes

  • Passing in task startedAt date to the alerts client. Lifecycle executor rules use this standardized timestamp for the @timestamp field of the AaD doc, as well as for the start and end time of an alert

Metric threshold rule changes

  • Switch to using the alerts client in the executor to report alerts and to get recovered alert information.

@ymao1 ymao1 force-pushed the metric-threshold-rule-aad branch from f8fcdee to e1326ae Compare October 4, 2023 12:11
@ymao1 ymao1 changed the title Metric threshold rule aad [Response Ops] Onboard metric threshold rule type to use framework alerts as data Oct 4, 2023
@@ -275,7 +275,7 @@ export class AlertsService implements IAlertsService {
// check whether this context has been registered before
if (this.registeredContexts.has(context)) {
const registeredOptions = this.registeredContexts.get(context);
if (!isEqual(opts, registeredOptions)) {
if (!isEqual(omit(opts, 'shouldWrite'), omit(registeredOptions, 'shouldWrite'))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric threshold alert registration options was actually used by multiple rule types within O11y so to limit the scope of this onboarding effort to a single rule type, I'm allowing different shouldWrite flags for the same context. Everything else should be the same (field maps, component template refs, etc).

> & {
// Defining a custom type for this because the schema generation script doesn't allow explicit null values
'kibana.alert.evaluation.values'?: Array<number | null>;
};
Copy link
Contributor Author

@ymao1 ymao1 Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We auto-generate a schema for each context based on the registered field map. The auto-generated schema can be found at packages/kbn-alerts-as-data-utils/src/schemas/generated/observability_metrics_schema.ts. However, the mapping definition for kibana.alert.evaluation.values is an array of scaled_float values which translates to a schema Array<string | number> (since ES accepts numerical strings and will coerce into number). However, the type that's generated by the code is Array<number | null>. To accommodate, I'm modifying the auto-generated schema (which is just meant to be a convenience based on the field map anyway).

getAlertByAlertUuid,
} = services;
const { alertsClient, savedObjectsClient } = services;
if (!alertsClient) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have the alerts client initialized under undocumented feature flag (which defaults to true), so the alertsClient is typed as possibly undefined. We plan to remove the feature flag so we will be able to remove this check at that point.

[ALERT_REASON]: reason,
[ALERT_ACTION_GROUP]: actionGroup,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

action group is set by the framework so does not need to be explicitly reported back

@ymao1 ymao1 self-assigned this Oct 5, 2023
@ymao1 ymao1 added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry v8.12.0 labels Oct 5, 2023
@ymao1 ymao1 marked this pull request as ready for review October 5, 2023 19:01
@ymao1 ymao1 requested review from a team as code owners October 5, 2023 19:01
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1
Copy link
Contributor Author

ymao1 commented Oct 9, 2023

@elasticmachine merge upstream

@mikecote mikecote self-requested a review October 10, 2023 17:54
Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response Ops changes LGTM! Pulled down and tested locally, tried the migration scenario and the alerts kept persisting as expected 👍

@maryam-saeidi maryam-saeidi self-requested a review October 13, 2023 08:09
Copy link
Member

@maryam-saeidi maryam-saeidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and alerts were generated as expected 👏🏻 Just added some clarification questions.

image

I also checked some action variables and I didn't see context.host, I will double check it on Monday. Please let me know if you have any other specific scenario in mind that needs to be tested :)

alerts: MetricsRulesTypeAlertDefinition,
alerts: {
...MetricsRulesTypeAlertDefinition,
shouldWrite: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this shouldWrite mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added this flag to the framework registration because we didn't want rules that were registered with both the rule registry and the alerting framework to have both write alerting docs. We should be able to remove this once we move all rules away from the rule registry

const indexedStartedAt =
getAlertStartedDate(UNGROUPED_FACTORY_KEY) ?? startedAt.toISOString();

alert.scheduleActions(actionGroupId, {
Copy link
Member

@maryam-saeidi maryam-saeidi Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the scheduling action happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is rolled into the alertsClient.report call. It doesn't seem like any rule executor reports an alert without wanting an action scheduled for it, so we do it in a single call instead of asking rule executors to explicitly create and then schedule.

@ymao1
Copy link
Contributor Author

ymao1 commented Oct 13, 2023

@elasticmachine merge upstream

Copy link
Member

@maryam-saeidi maryam-saeidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested context.host with only host.name group by field and it worked as expected.

@ymao1
Copy link
Contributor Author

ymao1 commented Oct 17, 2023

@elasticmachine merge upstream

@ymao1
Copy link
Contributor Author

ymao1 commented Oct 18, 2023

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #66 / Cases - group 1 Create case "before each" hook for "creates a case with custom fields"
  • [job] [logs] Explore - Security Solution Cypress Tests #2 / risk tab with new risk score renders the table renders the table

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
infra 45 43 -2

References to deprecated APIs

id before after diff
infra 49 48 -1

Total ESLint disabled count

id before after diff
infra 54 52 -2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit f4dda26 into elastic:main Oct 18, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Oct 18, 2023
@ymao1 ymao1 deleted the metric-threshold-rule-aad branch October 18, 2023 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Onboard an O11y rule type to use FAAD
7 participants