-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
@Kerry350 I'd like to hear your thoughts on this. Do you think it's feasible or did I miss anything? |
@weltenwort Thanks for breaking this down so thoroughly. This seems totally feasible to me using the methods you've stated. Do you think we should make some before / after metrics (probably just execution time) part of the ACs? (Tends to be easier to grab those things as we develop).
I agree we'll end up with something that looks quite different, but we should be able to come up with something "easy to follow" again. |
That would be neat, but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have? |
Yeah, fair point. I know Chris did some good work to produce high cardinality datasets for metrics, but I appreciate that might not help here. For characteristics I'd say we need one or more fields to represent the high cardinality nature, and then it would also be handy to have some fields with varied types (text, keyword etc) so that we can test against different comparators (which have the largest influence on the eventual query). |
After looking at the synthtrace architecture and discussing it with @miltonhultgren I'd say it should be pretty feasible to add support for generating log entries and setting up the correct mappings. It's not a quick thing, though, so I'd rather not make it a dependency of this issue. I'll try to come up with a simple bash script or so to use as a fallback in case this is prioritized higher than the synthtrace improvement. |
Sorry for the delay in responding but this sounds like an excellent plan of action to me. |
Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs) |
📓 Summary
We want to optimize the way the log threshold rule type executor queries and processes the data as to not block the nodejs event loop and decrease memory usage.
part of #98010
ℹ️ Background
The log threshold rule type excecutor currently handles four particular cases based on the rule params:
ungrouped with a single "count" criterion
size: 0
size: 0
ungrouped with a "ratio" of two criteria
size: 0
size: 0
grouped with a single "count" criterion
composite
agg over all grouping criteria asterms
with page size 2000grouped with a "ratio" of two criteria
composite
agg over all grouping criteria asterms
with page size 2000💡 Optimizations
The grouped cases have particularly high optimization potential due to two factors:
Perform more computation in Elasticsearch
In order to ensure the alert doesn't miss any groups to approximate
terms
results, the grouping is performed using acomposite
aggregation. That aggregation currently has the limitation that it can't be post-processes as a whole using a sibling pipeline agg such asbucket_selector
.The individual pages can be processed that way, though. A
bucket_selector
in a composite agg can remove buckets from a page that don't match certain criteria (such as having a doc count above or below a threshold). In consequence pages would be partly empty leading to smaller response sizes and the threshold computation would be performed by Elasticsearch. This advantage would be largest for the most complex case of the grouped ratio rule, where a script could calculate the ratio before performing the filtering. That also means that the nominator/denominator filters probably needs to be moved (or duplicated) beneath thecomposite
agg.If the
bucket_selector
script is written to take the threshold as params its compilation result could even be cached between executions.Process results incrementally
Assuming the ratio calculation and filtering is performed in a
bucket_selector
script as described above, the pages of groups could be processes immediately as they come in. This would interleave IO operations with the (much reduced) computation, which would allow for the event loop to preempt execution in favor of other workloads. Currently, the code has a high degree of code-reuse due to its carefully crafted decomposition (:clap:), which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.✔️ Acceptance criteria
The text was updated successfully, but these errors were encountered: