-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding Field capabilities API performance limitations #76509
Comments
I think the most relevant issue here is #74648 which if implemented would reduce the request count massively in the problematic cases, thus also reducing auth overhead. I think we can close here since we have a path forward for a fix in that issue? |
Thanks, thats extremely helpful and gives me some optimism. I'd prefer to keep this issue open as I'd benefit from having a central location where this issue is discussed. I've seen multiple cases where kibana isn't working due to slow field caps responses. It seems that there's been a reasonable amount of activity around resolving this problem that I've been completely unaware of. |
we have been hit hard by this issue a lot since we upgraded back in june to 7.11.2 We had a customer with one cluster . which had created an index pattern matching some 6000+ indices. Frequently whenever the custom accessed this index pattern , we would see nodes leave the cluster and in the audit logs we would see a huge spike with FieldCapabilitiesIndexRequest during the crash. That has led us to ask the customer to redesign their index strategy , but that is not something easily done and takes large amount of time. I can provide the Elastic support case id , if you are interested. |
Pinging @elastic/es-search (Team:Search) |
I've been intending to look into #74648 for some time -- I'll bump up its priority given the problems we've been facing. |
In response to @mattkime's request for a model, here's our understanding of the cause of the slowness:
I think we need to look into both sources of slowness to resolve the issue. Here's a proposal:
For context, we've already merged these changes to prevent the slow field caps requests from causing cluster instability. They will be available in 7.14.1 and 7.15. I think we aren't planning more changes from a stability standpoint (@original-brownbear feel free to correct this): |
I think this is going to be a big improvement. We've noticed this in particular on indices that use ECS, such as APM. ECS contains about ~1200 fields, and with ILM rollover it creates quite a few indices, (in particular it's rolling over empty indices for older versions similar to https://github.com/elastic/elasticsearch/issues/733490), so it adds up quickly. But the mapping is 99% the same. |
@jtibshirani do you think we should open a spin-off meta issue that describes the plan you listed above? |
@javanna this is a good question. I think we could track this work under our shard scalability efforts: #77466. @original-brownbear -- are there any unresolved items here that you think we should add to the the shard scalability meta issue? If your benchmarks haven't shown that a change is important, we could consider those as "won't fix" for now. (We could still keep #78665 separate, since that's specific to the CCS case). |
A field caps request still consumes considerable memory when executed across a large number of indices. Also for nodes holding a lot of indices with large mappings the transport messages shipped around are quite sizeable still. As far as I understand it we could improve both of these issues by doing aggregation of the numbers on the data nodes instead of aggregating all of them on the coordinating node? If that's still a reasonable plan, should I add an issue for that? I think this is not something to act on this in the next couple of weeks because at the moment cluster size is still practically gated at values that ensure field caps is at < 10s response time, but I wouldn't consider this "won't fix". |
I should also point out that
is now relatively trivial to do. We track the sha256 of each mapping (hashed via a relatively stable algorithm) and deduplicate |
@original-brownbear sounds good to add an issue for merging responses, thanks! Feel free to ping the search team when you'd like us to pick up work. If everything is tracked in #77466 then I'll close this out. |
Oops, I commented at the same time as you. Good to know about the mapping hash. I think we could put that to use when collecting and merging responses. |
Closing in favor of these issues:
The first one is tracked in this meta issue: #77466. |
Since 7.11, Kibana no longer caches field caps api responses for index patterns - elastic/kibana#82223 - the field caps api is called whenever a kibana index pattern is loaded, such as on page load or when navigating between kibana apps.
This works well for the vast majority of our users but there are cases where the field caps response can take >20s. It seems to occur when a large number (thousands) of indices are matched, there are a large number (thousands) of fields, and RBAC is being used. When the response takes too long kibana appears to be broken and can overwhelm a cluster. We consider <1s responses to be performant.
We need someone from the ES team to create a model that describes the performance implications of each of these entities in order to resolve these problems and prevent them from occurring.
Original attempt at this discussion - #59581
Real world kibana use case that drove decision not to cache field caps response - elastic/kibana#71787 (comment)
The text was updated successfully, but these errors were encountered: