Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cases] Case action #168369

Merged
merged 73 commits into from
Apr 12, 2024
Merged

[Cases] Case action #168369

merged 73 commits into from
Apr 12, 2024

Conversation

cnasikas
Copy link
Member

@cnasikas cnasikas commented Oct 9, 2023

Summary

Depends on: #166267, #170326, #169484, #173740, #173763, #178068, #178307, #178600, #180437

PRs:

Fixes: #153837

Testing

Run Kibana with --run-examples if you want to use the "Always firing" rule.

Create a rule with a case action in observability and the stack. The security solution is not supported. You should not be able to assign a case action in a security solution rule.

  1. Test the "Reopen closed cases" configuration.
  2. Test the "Grouping by" configuration. Only one field is allowed. Not all fields are persisted in alerts. If you select a field not part of the alert the case action will create a case where the grouping value is set to unknow.
  3. Test the "Time window" feature. You can comment out the validation to test for shorter times.
  4. Verify that the case action is experimental.
  5. Verify that based on the rule type the case is created in the correct solution.
  6. Verify that you cannot create a rule with the case action on the basic license.
  7. Verify that the execution of the case action fails if you do not have permission for cases. Pending work on the system actions framework level to not allow users to create rules with system actions where they do not have permission.
  8. Stress test the case action by creating multiple rules.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

Release notes

Automatically create cases when an alert is triggered.

@cnasikas cnasikas self-assigned this Oct 9, 2023
@cnasikas cnasikas added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Cases Cases feature release_note:feature Makes this part of the condensed release notes labels Oct 9, 2023
cnasikas and others added 25 commits October 16, 2023 11:29
## Summary

This PR is a continuation of the work for the Case action. This PR
implements the basic logic of the case connector. Specifically:

1. Group the alerts based on the grouping provided by the user
2. Create the Oracle's SO IDs to fetch the records. If they do not exist
they will get created and the counter will be set to 1.
3. Create the cases' SO IDs to fetch the Cases. If they do not exist
they will get created.
4. Attach the alerts to the corresponding cases.

Not in this PR:
- Handle errors
- Retries on errors
- Reopen cases
- Time window
- Race conditions
- Circuit breakers

Depends on: #168370,
#169484

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
## Summary

Depends on: #171754

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
## Summary

This PR:

1. Creates the `CasesConnectorError` error
2. Separate the execution logic by moving the current logic to a new
class called `CasesConnectorExecutor`
3. Let the `CasesConnector` class handle only the retry logic of the
connector
4. Implements the [Full jitter backoff
algorithm](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
which is used as the retry strategy of the connector

Depends on: #172709

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
## Summary

This PR adds logging to the case action

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@js-jankisalvi js-jankisalvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a weird bug, created a log threshold rule in o11y, alert overview shows created case in o11y. But it created a case in stack management, because the consumer for log threshold ruleType is alerts.

obs-stack.case.action.conflict.mov

@cnasikas
Copy link
Member Author

cnasikas commented Apr 10, 2024

Found a weird bug, created a log threshold rule in o11y, alert overview shows created case in o11y. But it created a case in stack management, because the consumer for log threshold ruleType is alerts.

obs-stack.case.action.conflict.mov

Interesting bug. @XavierM recommended fallback to the producer of the rule if the consumer is a legacy one. Because it needs some slight changes in the system actions framework, I wonder if we can fix the bug on another PR.

}

this.logger.debug(
`[CasesConnector][CasesConnectorExecutor][attachAlertsToCases] Attaching alerts to ${casesUnderAlertLimit.length} cases that do not have reach the alert limit per case`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
`[CasesConnector][CasesConnectorExecutor][attachAlertsToCases] Attaching alerts to ${casesUnderAlertLimit.length} cases that do not have reach the alert limit per case`,
`[CasesConnector][CasesConnectorExecutor][attachAlertsToCases] Attaching alerts to ${casesUnderAlertLimit.length} cases that have not reached the alert limit per case`,

: params.rule.name;

const groupingDescription = this.getGroupingDescription(grouping);
const description = `This case is auto-created by ${ruleName}. \n\n Grouping: ${groupingDescription}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
const description = `This case is auto-created by ${ruleName}. \n\n Grouping: ${groupingDescription}`;
const description = `This case is created automatically by ${ruleName}. \n\n Grouping: ${groupingDescription}`;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shani suggested some changes in the description too. I was thinking of doing it on another PR. Do you mind if we also address this in the next PR?

@cnasikas
Copy link
Member Author

Some updates:

  • Janki's bug: We will fall back to the producer of the rule. Needs some changes in the system actions framework to pass the producer to the case action. I would prefer to do it on another PR.
  • Per Shani's feedback: Remove years and months from the time window - Done in f73ec73 (#168369)
  • Do not allow users to create a rule with a case action if they do not have access to it. Blocked by [Actions] Authorize system action when create or editing rules #180437. Done in a7ddc63 (#168369)
  • Show only aggregatable fields. Done in 888b043 (#168369)
  • The user is "Unknown" when creating a new case from the case action. We need to make some changes in the cases client . It will be done on another PR.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
cases 717 752 +35

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-saved-objects-base-server-internal 180 181 +1
actions 290 292 +2
triggersActionsUi 567 568 +1
total +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
cases 446.5KB 475.2KB +28.7KB
triggersActionsUi 1.6MB 1.6MB +174.0B
total +28.9KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
cases 148.9KB 153.0KB +4.1KB
Unknown metric groups

API count

id before after diff
@kbn/core-saved-objects-base-server-internal 223 224 +1
actions 296 298 +2
triggersActionsUi 593 594 +1
total +4

async chunk count

id before after diff
cases 26 27 +1

ESLint disabled line counts

id before after diff
cases 58 61 +3

Total ESLint disabled count

id before after diff
cases 74 77 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @cnasikas

Copy link
Contributor

@js-jankisalvi js-jankisalvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What an amazing work!! 🤯 🎉 👏
Finally case action is here!! 📢 😃
Thank you for detailed scenarios, it made testing easier ❤️

Create a rule with a case action in observability and the stack. The security solution is not supported. You should not be able to assign a case action in a security solution rule. ✅

Test the "Reopen closed cases" configuration. ✅
Test the "Grouping by" configuration. Only one field is allowed. Not all fields are persisted in alerts. If you select a field not part of the alert the case action will create a case where the grouping value is set to unknow.

only single field is allowed, grouping value is unknown for non alert fields ✅ - we still need to prevent new field selection which is not in the list which @adcoelho mentioned, could be done in another PR.

Test the "Time window" feature. You can comment out the validation to test for shorter times. ✅
Verify that the case action is experimental. ✅

Verify that based on the rule type the case is created in the correct solution.

worked as expected, except the bug with log threshold rule with alerts as consumer

Verify that you cannot create a rule with the case action on the basic license. ✅
Verify that the execution of the case action fails if you do not have permission for cases. Pending work on the system actions framework level to not allow users to create rules with system actions where they do not have permission.

verified with none or read permission, in both scenario case action is disabled ✅

Stress test the case action by creating multiple rules. ✅

@cnasikas
Copy link
Member Author

we still need to prevent new field selection which is not in the list which @adcoelho mentioned, could be done in another PR.

Thanks for that! I totally forgot it.

Copy link
Contributor

@jcger jcger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@cnasikas cnasikas merged commit b735d8c into main Apr 12, 2024
36 checks passed
@cnasikas cnasikas deleted the case_action branch April 12, 2024 09:01
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Apr 12, 2024
klacabane added a commit that referenced this pull request Apr 15, 2024
## Summary

Creates a system connector that can call the observability ai assistant
to execute actions on behalf of user. The connector is tagged as tech
preview.

The connector can be triggered when an alert fires. Connector can be
configured with an initial message to the assistant which generates an
answer and triggers potential actions on the assistant side. The current
experimental scenario is to ask the assistant to generate a report of
the alert that fired (by initially providing some context in the first
message), recalling any information/potential resolutions of previous
occurrences stored in the knowledge base and also including other active
alerts that may be related. One last step that can be asked to the
assistant is to trigger an action, currently only sending the report (or
any other message) to a preconfigured slack webhook is supported.

## Testing
_Note: when asked to send a message to another connector (in our case
slack), we'll try to include a link to the generated conversation. It is
only possible to generate this link if
[server.publicBaseUrl](https://www.elastic.co/guide/en/kibana/current/settings.html#server-publicBaseUrl)
is correctly set in kibana settings._

- Create a slack webhook connector
- Get slack webhook. I can share one and invite you to the workspace, or
if you want to create one:
    - create personal workspace at https://slack.com/signin#workspaces
    - create an app for that workspace at https://api.slack.com/apps
- under Features > OAuth & Permissions > Scopes > Bot Token Scopes, add
`incoming-webhook` permission
    - install the app
    - webhook url is available under Features > Incoming Webhooks
- Create a rule that can be triggered with available documents and
attach observability AI assistant connector. (I use `Error Count
Threshold` and generate errors via `node scripts/synthtrace
many_errors.ts --live`)
- configure the connector with one genai connector and a message with
instructions. Example:
```
High error count alert has triggered. Execute the following steps:
  - create a graph of the error count for the service impacted by the alert for the last 24h
  - to help troubleshoot recall past occurrences of this alarm, also any other active alerts. Generate a report with all the found informations and send it to slack connector as a single message. Also include the link to this conversation in the report
```
- Track alert status and verify connector was executed. You should get a
slack notification sent by the assistant, and a new conversation will be
stored

TODO
- unit/integration tests - see
#168369 for reference
implementation
- documentation

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Dario Gieselaar <[email protected]>
cnasikas added a commit that referenced this pull request Apr 16, 2024
## Summary

In this PR:

- Address @adcoelho comments regarding documentation.
- Fix @js-jankisalvi bug about unsupported consumers
(#168369 (review)).
- Address @shanisagiv1 feedback regarding the title and the description.
Specifically:
- The title changed to "<rule_name> - Grouping by <grouping_by_value>
(Auto-created)".
- The description changed to "This case was created by the Case action
in <rule_name_link>. The assigned alerts are grouped by
<grouping_by_key>:<grouping_by_value>".
- Add the grouping key as a tag.

<img width="2289" alt="Screenshot 2024-04-13 at 4 41 36 PM"
src="https://github.com/elastic/kibana/assets/7871006/63e17947-5f39-4437-820b-7c69f42bfbe3">

The issue about the "Unknown" user will be fixed in another PR.

About @adcoelho bug:


https://github.com/elastic/kibana/assets/7871006/c46aa7c4-9d1a-475b-9d07-6bdff3ef00c8

I think it is fine to leave it as it is because a) the value will not be
saved even if they are added b) an error is being shown c) the only way
to do it properly is to validate while the user is typing which is going
to lead to bad UX. If you feel otherwise let me know.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Cases Cases feature release_note:feature Makes this part of the condensed release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto case creation when alerts are detected
9 participants