-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904
[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904
Conversation
x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
Outdated
Show resolved
Hide resolved
@@ -96,27 +99,22 @@ export const evaluateRule = <Params extends EvaluatedRuleParams = EvaluatedRuleP | |||
// If any previous groups are no longer being reported, backfill them with null values | |||
const currentGroups = Object.keys(currentValues); | |||
|
|||
const missingGroups = prevGroups.filter((g) => !currentGroups.includes(g)); | |||
const missingGroups = difference(prevGroups, currentGroups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If currentGroups
is large enough, we could convert it to a hash/set for constant read access to bring this down to a linear execution, but that's probably what lodash does ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gains with using Lodash's difference
is probably sufficient plus it's easy enough to read. If the new code was in place when I was doing the performance testing, I probably wouldn't have even bothered with it.
…crease-composite-size-metric-threshold
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
…crease-composite-size-metric-threshold
…crease-composite-size-metric-threshold
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this processing exposed to an endpoint or is it a background task ? I'm wondering if we could leverage performance tooling to tune group_by_page_size
value depending on its impact on different hardware
@klacabane We consulted with the Elasticsearch team quit a bit and 10K was their general recommendation. I recently tested this on a smaller 8G node and it was similarly performant. I'm feeling pretty confident about 10K but the setting will give user's an escape hatch if the need it. |
… and optimize processing (elastic#121904) * [Metrics UI] Increase composite size for Metric Threshold Rule to 10K * Adding performance optimizations * Fixing metrics_alerting integration test * fixing tests * Fixing integration test and config mock * Removing the setTimeout code to simplify to a for/of * Adding new setting to docs * Adding metric_threshold identifier to the config setting (cherry picked from commit ae0c8d5) # Conflicts: # x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
… and optimize processing (#121904) (#126506) * [Metrics UI] Increase composite size for Metric Threshold Rule to 10K * Adding performance optimizations * Fixing metrics_alerting integration test * fixing tests * Fixing integration test and config mock * Removing the setTimeout code to simplify to a for/of * Adding new setting to docs * Adding metric_threshold identifier to the config setting (cherry picked from commit ae0c8d5) # Conflicts: # x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
Summary
This PR closes #119501 by introducing a new configuration value
xpack.infra.alerting.metric_threshold.group_by_page_size
which controls the composite sizes for the group by queries. The default value is now set to10000
, the original values was set at100
.I also took the opportunity to do some performance optimizations which also closes #120249 by refactoring how the Metric Threshold rule processes the data. I setup a baseline by indexing 10K unique events into Elasticsearch every 60 seconds and creating an Metric Threshold rule with a condition that checks for document count, grouped by a unique event identifier. Then I instrumented all the code where it was looping through the results using
console.time/console.timeEnd
. Ultimately, I ended up identifying three sections to focus on.The first was a
reduce
that converts the ES buckets into a large hash with the grouping as the keys and the parsed results as the value. While I was performance testing this code, I noticed that it was also blocking the event loop for about 20 seconds with 10K groups. This code is identified byReduce results into groups
in the results.kibana/x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
Lines 205 to 219 in 89cdb47
The second was a
filter / include
which was basically finding the difference between the groups returned from Elasticsearch and the groups the rule had previously seen. In the 10K entity scenario, the gains seem modest but when I change to 50K, this code was adding an additional 1 second to process. This code is identified byFind missing groups
in the results.kibana/x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
Line 99 in 89cdb47
The third was another
reduce
that was used to backfill the missing groups with eitherzero
ornull
. When the groups go missing, this step takes ~19 seconds and also blocks the event loop. This code is identified byBackfill previous groups
in the results.kibana/x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
Lines 105 to 119 in 89cdb47
There are two scenarios which I measured:
I also tested this with 50K unique events. The Metric Threshold rule executor handles the 50K group bys with little effort, it takes about 2-4 seconds to process. BUT the alerting framework ends up with the following error:
This error is being tacked via #122288
Results
10K entities reporting
As you can see above the biggest gain was ~21 seconds from the "Reduce results into groups" event. There is also a modest gain or ~80 milliseconds with the "Find missing groups" event.
10K entities stop reporting
The biggest gain in this scenario is the "Backfill previous groups" event with a ~19 second difference.
Setup for PR Review
docker-compose.yml
, setEVENTS_PER_CYCLE
to10000
andPAYLOAD_SIZE
to5000
ELASTIC_VERSION=8.1.0-SNAPSHOT docker compose up --remove-orphans
yarn start --ssl
. DO NOT start Elasticsearch from the Kibana directory, the Docker cluster is configured to work with Kibana source.label.eventId