-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010
Comments
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui) |
It looks like the filter is applied as a child as the composite aggregation, that means that if the group by field has high cardinality, it will run into issues regardless of whether the user has specified a filter. |
@dgieselaar That is true, but intentional. The reasoning is explained in #68250 (review): TL;DR: filtering too early means we can't detect disappearing groups
|
@weltenwort I don't see an option in the UI to alert on no data? Am I looking in the wrong place? (Looks like the popovers keep getting stuck as well, btw) |
You can alert on a threshold of |
I think this also arises when you query for "less than n documents", if I recall correctly (since the 0 scenario is always less than n). |
Yes, that sounds correct. 👍 It would still be valuable to optimize the query for "more than" cases IMHO. |
One solution would be to offer the user two filters: one that is applied as part of the query, and one that is applied as a filter aggregation. |
If there's anything we can do at Platform level to make this kind of thing easier, please don't hesitate to pull us in :) |
Yes, that would be one approach, but it might make it a lot harder for the user to understand. Maybe there's a way to structure this in the UI in such a way that the semantics of opting into an expensive execution strategy would become apparent. Those would all be breaking changes, though. |
I notice we are using a composite.size of 40 for this query. Are we basing this number exactly on the number of buckets we expect? We bumped the metrics inventory query up to 2000 for this size (based on bucket calculations) and it had a massive effect on perf. |
FWIW, I'm not sure if this solves the issue where Kibana itself OOMs (as the responses are kept in memory). |
Probably not, because it would only reduce the number of requests it performs to fetch the same amount of data. |
I think it still likely makes the rule execution take longer than necessary unless we have a good reason to keep it at 40. But yes, we probably need to think broader than that. |
I'm going to implement this quick win, which will at least dramatically improve the more than scenario. I'll also increase the composite size. And lastly add a note about high cardinality fields to the docs (more of a stop gap until we provide some UI options). Edit: I'll also add an in-UI warning when using a less than comparator and a group by field that this can cause problems. We can add this without committing to options that will need to be persisted. |
Hi, I ended up here tracing from https://github.com/elastic/sdh-kibana/issues/2266 and why this isn't a top N query. For an alert like "count is more than 75 group by arn" I don't think I would expect an exhaustive list of all possible arns when there are thousands of them. I would have expected a top N query such that we could say "more than 75 error messages found for these top 10 arns" Perhaps with a note like "{{sum_other_doc_count}} documents were skipped after finding those top 10, which may contain additional arns" |
I guess "less than" might be a bit of a gotcha given that you should avoid |
@matschaffer There might be misunderstanding here. Grouping doesn't create an alert with a list, but a separate alert for each group. That way a downstream notification tool can fan out the alerts to the proper channels. |
Ahhh... so each one could potentially go to a different channel? |
Yes, or be indexed using the |
Dropping an idea @weltenwort mentioned in a chat, it could be nice if we had some way to simulate an rule execution when a rule is being created. if we could couple that with some profiling in the Alerts Framework to give timing information back to the users we might be able to guide them to create more performant alerts. |
IIRC @Kerry350 has been thinking about simulating the queries for a while. |
👋 👋 Yeah, I've been wanting to add query simulation for a while (release one 😂). There are so many cases in SDHs and so on where a simple copy-pasta of the query ES is executing would have been enough. I planned to dig into this for my Space, Time. Being able to couple that with some Alerting profiling would be pretty cool. The Alerting team have a ticket for adding an "explain" functionality, which is what we want to aid in these debugging scenarios ( |
After working on "Check and warn about high cardinality of the grouping field when creating the job." for a few days and speaking with @weltenwort, my conclusion is that it is hard to offer meaningful up front warnings since the final performance will depend on the deployment size and load during execution (which also raises the question of when to perform such health checks). Simply having a high cardinality field that will create a lot of pages does not mean the performance will be degraded because "high" needs to be put in relation to the cluster size and load. For a small deployment, 10 pages of a 100 buckets each might kill it while a large deployment chews right through it. While we can make progress by changing how we process the buckets to avoid going OOM or blocking the Kibana JavaScript thread for too long, it feels like we also need some way to alert if the performance of a rule itself is starting to degrade (Rule A takes on average longer to execute than its interval setting) or the cluster health is degrading as a result of a rule executing (when Rule A is executing, memory consumption spikes 9 out of 10 times). |
It's possible that the changes made by @simianhacker in this PR -> #121904 can be implemented here as well. |
We'd likely have to check if 10k is a good number for the log threshold executor and do a similar performance investigation of that code to find the bottlenecks to remove. Should we update this issue for that or create a new issue for that work? @jasonrhodes |
From my perspective, I think we need to start assuming that the size value should start much, much bigger unless we have a specific reason to suspect that it will cause "too many buckets" errors. But yes, we should do some spot checking for sure. I think a new ticket would be a good idea, although I think @simianhacker has already done some work here on implementing this across some of the other rules? The main things I think we should focus on for now are:
(1) needs perf evaluation If someone can make some tickets to work on this, that'd be great, but if not I will put it on my list to break this down and we can plan it for an upcoming cycle. |
I can create those tickets |
@jasonrhodes there already is an issue (#122550) for adding cancellation on the @elastic/actionable-observability board. We should clarify who tackles that and make sure it doesn't collide with the executor refactoring required for the |
@jasonrhodes I created an issue for the optimized querying in #124130. Let me know what you think. |
Yeah I think #124130 in particular just replaces this one. |
Problem description
When the user creates a log threshold alert with a "group by" field of large cardinality, the alert executor will paginate through a large number of
composite
aggregation pages. This can use and possibly exceed the resources available in Elasticsearch and Kibana and thereby negatively impact the availability of the service. Additionally, the alert execution might time out and miss alerts that should have been fired.On top of that the query used for checking the condition prioritizes correctness over performance by filtering out non-matching groups as late as possible. This allows for checking for zero-count threshold, but prevents Elasticsearch from optimizing the query more aggressively.
Possible solutions
composite
agg to the globalbool
filter.)The text was updated successfully, but these errors were encountered: