-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add circuit breaker for aggregation memory limit #67474
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
@not-napoleon Request circuit breaker trips is more like a final check for protection and is slow to react. What's more, it will reject all the requests no matter it is small request or big one after breaker trips . In most cases, user need to get rapid response from ES. If some request is too large, user still want to get rejection message quickly. If @williamrandolph is this a cluster-level setting which apply to each request context like |
I see. Essentially, you're asking for a way to cap a large aggregation's memory usage without exhausting the available cluster resources. So for example, you might want to say that one aggregation can never use more than 10% of memory, or something like that (just making up a number, that's not a recommendation). This would then kill large requests while allowing smaller requests to complete, which is not the case with the current request circuit breaker. I'm not sure this will give a rejection message faster than the current approach though. In general, aggregation memory usage is highest during the reduce phase, which happens fairly late in the request life cycle. Adding a per request breaker doesn't really change that. |
@not-napoleon Thanks for reading my comment and replying. I think you totally get my point. Datanode breaker tripIn our production, we have 30GB heap ES nodes and we run a high cardinality query, if we rely on circuit breaker, it takes 90 seconds before parent circuit breaker trips(real memory exceeds 29.4GB). But if we rely on search.max_buckets, it only takes several seconds to raise the exception.
Coordinator breaker tripAs you said, aggregation memory usage is highest during the reduce phase, which happens fairly late in the request life cycle. Adding a per request breaker doesn't really change that. According to our test, the most memory consumption part is the shard response from all shards especially batched_reduce_size is not tuned. And #67478 is already trying to fix this issue but it is not a per-request breaker. And if we can only break the big query and let small queries run smoothly ,it will be a benefit too. |
Thanks for taking the time to reply. We discussed this as a team, and currently we do not want to head in this direction. It's not at all clear that killing large queries in favor of smaller queries should be a good general policy. The existing aggregation circuit breaker covers the majority of cases in our opinion. If we were to take up something like this in the future, I think it would have to come from a different direction. Elasticsearch doesn't currently support (or have a plan to support) per-user usage limiting, but I could see that as a more reasonable direction to take something like this. Overall, we want to support the ability to run very large queries as needed, and don't think killing a query just because it is large, when we otherwise might be able to serve it, is a good policy. The assumption that small queries are more important than large ones doesn't play out in all cases. |
Add a circuit breaker that limits the amount of memory an aggregation can use. For example,
breaker.search.aggregation_memory.limit
.This should work better than max_buckets, as this would be more dynamic.
This issue was first raised as part of #62457 .
The text was updated successfully, but these errors were encountered: