Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling the alerting throughput ceiling from 3,200 to 32,000+ rules per minute #188194

Open
38 of 48 tasks
mikecote opened this issue Jul 12, 2024 · 1 comment
Open
38 of 48 tasks
Assignees
Labels
Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Jul 12, 2024

Problem Statement

Usage of Kibana background tasks and alerting rules is continuously growing and we are approaching our scalability ceiling of 3,200 tasks per minute

Objective

Increase the overall alerting rule throughput by 10x before Jan ‘25

Goals

  • Raise the per cluster ceiling from 3,200 rules per minute to 32,000 rules per minute (10x)
  • Scale horizontally past the current 16 Kibana node limit
  • Scale vertically by increasing the per node throughput of 200 rules per minute

Scope

  • Making it possible to attain 10x scale at the framework level (32,000 rules per minute)
  • Attain 10x scale when alerting rules are optimized to use Kibana and Elasticsearch resources efficiently
    • Elasticsearch Query rule, Index Threshold rule, etc are already optimized for efficiency
  • Attain 10x scale when alerting rules complete running within a few seconds
  • Work with solution teams to optimize their rule types

Workstreams

  • 10x at the framework level
  • Rule type optimizations
  • Elasticsearch API key performance
  • Vertical scaling on Cloud
  • Regular performance testing

Roadmap for "10x at the framework level" workstream

1. PoC

  • PoC to attain 10x alerting throughput (32,000 rules per minute) kibana#182394

2. Solve the horizontal scalability limits by allowing more Kibana nodes to run tasks

  • Allow multiple task claiming strategies (feature flag) kibana#171677
  • Create a task claiming strategy that doesn't rely on forced index refreshes and performs search+get+update within Kibana kibana#181325
  • Create a Kibana discovery service to help assign partitions to the Kibana nodes kibana/issues/187696
  • Assign tasks to partitions kibana#187698
  • Assign partitions to Kibana nodes kibana#187700

3. Solve the vertical scalability limits by running more tasks per Kibana node

  • Implement resource based task scheduling to run more tasks on larger Kibana configurations (memory and CPU) kibana#185043
  • Adjust default capacity when running in ECH kibana#189117
  • Set Indicator match rules as ExtraLarge cost kibana#189112
  • Change poll interval default to 500ms kibana#190059

4. Work items remaining before rolling out to Serverless

  • Cache partitions calculation in Kibana for 10s kibana#189119
  • Rename task claimers kibana#190542
  • Add more functional tests for resource based task claiming kibana#189111
  • Make some new task manager constants configurable kibana#190734
  • Fix starvation issue when there are multiple limited concurrency tasks kibana#184937
  • Fix errors during processing task result that are not shown in metrics kibana#184173
  • Convert logger.warn to thrown errors (or something alike) in mget claims strategy so serverless metrics picks them up kibana#190082
  • Update some runbooks in preperation for rolling out mget to serverless response-ops-team#226
  • Fix errors during marking tasks as running that are not shown in metrics kibana#184171
  • Create ad-hoc rollout overview dashboard
  • Performance test at small and large scale on serverless

5. Initial rollout to Serverles

6. Work items remaining before 8.16 feature freeze

  • Increase the rules per minute circuit breaker from 10k to 32k kibana#190057
  • Set action tasks as Tiny cost kibana#190542
  • Move tasks directly from idle state to running (skip claiming) kibana#184739
  • Skip loading dataView and searchSourceClient services unless necessary kibana#184322
  • Perform partial updates when claiming and releasing tasks kibana#187704
  • Cache query delay settings kibana#184321
  • Cache flapping settings and only load when necessary kibana#149884
  • Allow list on ECH and Docker some new kibana.yml settings kibana#192183
  • Modify task manager docs to use xpack.task_manager.capacity kibana#192185
  • Skip loading the rule's schedule a second time kibana#192396
  • Partial update tasks after they run kibana#192398
  • Cache and load maintenance windows only when necessary kibana#184324
  • Partial update rules after they run kibana#192397
  • Finish jest, functional, integration tests for mget task claimer kibana#184942

7. Optional follow-ups and optimizations after 8.16

  • Recurring tasks to be rescheduled based on runAt rather than startedAt when the gap between the two is less than 10 seconds kibana#189114
  • Remove Kibana validation for event log documents
  • Run CPU profile to determine other CPU-intensive operations
  • Run heap snapshots to determine other memory-intensive operations
  • Fix the bulk API performance so users can manage 32,000 rules kibana#188558
  • Move the task claiming away from using scripted sorting (maybe using ES|QL)
  • [Task Manager] Explore using process.hrtime for the Kibana discovery service #188465
  • Review rules per minute limits set on serverless and max number of bg nodes (6)
  • Autoscaling bg nodes on ECH, currently scaling to 32 Kibana nodes provides 32 UI and 32 BG nodes (could half the needs)
  • Ensure we rotate whenever pulling tasks from the limited concurrency queue
  • Continue claiming tasks on conflicts kibana#184940

8. Blogpost

@mikecote mikecote added Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote mikecote self-assigned this Jul 17, 2024
mikecote pushed a commit that referenced this issue Nov 6, 2024
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Nov 6, 2024
…lastic#199043)

## Summary

This PR adds elastic#188194 to the
8.16.0 Kibana release notes.
It also fixes a formatting issue.

### Preview

https://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html
(cherry picked from commit 14a1a92)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Nov 6, 2024
…lastic#199043)

## Summary

This PR adds elastic#188194 to the
8.16.0 Kibana release notes.
It also fixes a formatting issue.

### Preview

https://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html
(cherry picked from commit 14a1a92)
kibanamachine added a commit that referenced this issue Nov 6, 2024
…otes (#199043) (#199105)

# Backport

This will backport the following commits from `main` to `8.16`:
- [[DOCS] Add alerting performance enhancements to 8.16 release notes
(#199043)](#199043)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Lisa
Cawley","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-06T11:23:23Z","message":"[DOCS]
Add alerting performance enhancements to 8.16 release notes
(#199043)\n\n## Summary\r\n\r\nThis PR adds
#188194 to the\r\n8.16.0 Kibana
release notes.\r\nIt also fixes a formatting issue.\r\n\r\n###
Preview\r\n\r\n\r\nhttps://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html","sha":"14a1a92a422fa7fc69902e5a80d071b182dc37aa","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Team:Docs","release_note:skip","v9.0.0","docs","v8.16.0","backport:version","v8.17.0"],"title":"[DOCS]
Add alerting performance enhancements to 8.16 release
notes","number":199043,"url":"https://github.com/elastic/kibana/pull/199043","mergeCommit":{"message":"[DOCS]
Add alerting performance enhancements to 8.16 release notes
(#199043)\n\n## Summary\r\n\r\nThis PR adds
#188194 to the\r\n8.16.0 Kibana
release notes.\r\nIt also fixes a formatting issue.\r\n\r\n###
Preview\r\n\r\n\r\nhttps://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html","sha":"14a1a92a422fa7fc69902e5a80d071b182dc37aa"}},"sourceBranch":"main","suggestedTargetBranches":["8.16","8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/199043","number":199043,"mergeCommit":{"message":"[DOCS]
Add alerting performance enhancements to 8.16 release notes
(#199043)\n\n## Summary\r\n\r\nThis PR adds
#188194 to the\r\n8.16.0 Kibana
release notes.\r\nIt also fixes a formatting issue.\r\n\r\n###
Preview\r\n\r\n\r\nhttps://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html","sha":"14a1a92a422fa7fc69902e5a80d071b182dc37aa"}},{"branch":"8.16","label":"v8.16.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.x","label":"v8.17.0","branchLabelMappingKey":"^v8.17.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Lisa Cawley <[email protected]>
kibanamachine added a commit that referenced this issue Nov 6, 2024
…tes (#199043) (#199106)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[DOCS] Add alerting performance enhancements to 8.16 release notes
(#199043)](#199043)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Lisa
Cawley","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-06T11:23:23Z","message":"[DOCS]
Add alerting performance enhancements to 8.16 release notes
(#199043)\n\n## Summary\r\n\r\nThis PR adds
#188194 to the\r\n8.16.0 Kibana
release notes.\r\nIt also fixes a formatting issue.\r\n\r\n###
Preview\r\n\r\n\r\nhttps://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html","sha":"14a1a92a422fa7fc69902e5a80d071b182dc37aa","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Team:Docs","release_note:skip","v9.0.0","docs","v8.16.0","backport:version","v8.17.0"],"title":"[DOCS]
Add alerting performance enhancements to 8.16 release
notes","number":199043,"url":"https://github.com/elastic/kibana/pull/199043","mergeCommit":{"message":"[DOCS]
Add alerting performance enhancements to 8.16 release notes
(#199043)\n\n## Summary\r\n\r\nThis PR adds
#188194 to the\r\n8.16.0 Kibana
release notes.\r\nIt also fixes a formatting issue.\r\n\r\n###
Preview\r\n\r\n\r\nhttps://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html","sha":"14a1a92a422fa7fc69902e5a80d071b182dc37aa"}},"sourceBranch":"main","suggestedTargetBranches":["8.16","8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/199043","number":199043,"mergeCommit":{"message":"[DOCS]
Add alerting performance enhancements to 8.16 release notes
(#199043)\n\n## Summary\r\n\r\nThis PR adds
#188194 to the\r\n8.16.0 Kibana
release notes.\r\nIt also fixes a formatting issue.\r\n\r\n###
Preview\r\n\r\n\r\nhttps://kibana_bk_199043.docs-preview.app.elstc.co/guide/en/kibana/master/release-notes-8.16.0.html","sha":"14a1a92a422fa7fc69902e5a80d071b182dc37aa"}},{"branch":"8.16","label":"v8.16.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.x","label":"v8.17.0","branchLabelMappingKey":"^v8.17.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Lisa Cawley <[email protected]>
@mikecote mikecote changed the title Scaling the alerting throughput ceiling from 3,200 to 32,000 rules per minute Scaling the alerting throughput ceiling from 3,200 to 32,000+ rules per minute Nov 6, 2024
mgadewoll pushed a commit to mgadewoll/kibana that referenced this issue Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

2 participants