Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we limit the number of fields that can be retrieved by the “fields” API #69983

Closed
cbuescher opened this issue Mar 4, 2021 · 4 comments
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.13.0 v8.0.0-alpha1

Comments

@cbuescher
Copy link
Member

Currently we don’t limit the number of fields that can be retrieved using the “fields” API.
The original reasoning was that field values are retrieved from an already loaded “source”, so the actual lookup from the source map should come with relatively small cost.

In order to add of the ability to include unmapped fields, we are making use of Automata to match field patterns that have the “include_unmapped” option set. We do this because we don’t know which unmapped leave values the source contains and want to be able to efficiently match wildcard field paths while traversing the source. These Automata by default come with a limit on the number of states they can have (by default 10000) in order to prevent unexpected memory consumption. This is quite sufficient when we have a small number of “fields” pattern with “include _unmapped” set, as should normally be the case.

However, it is possible to exceed this size limit when using the “_fields” API with hundreds or thousands of field patterns that all have the "include_unmapped” option turned on, in which case the request will fail.
This led us to think about whether we should put a limit on the number of fields (or field patterns?) that the API can retrieve, which could be a dynamic index setting like the ones we e.g. have for doc value fields (index.max_docvalue_fields_search).

There are some questions here:

  • apart from the request and response size, retrieving tons of fields should not be considered as costly as e.g. doc_values lookup. The discussion was triggered by the discussion about the right sizing of the automaton used only in the context of field patterns with the “include_unmapped” option
  • would we want to limit the number of field pattern the users sends in the API request or the number of fields the patterns are resolved to? “*” is just one pattern that can return thousand of fields, but that should be not be the problem
  • with a limit on the number of field pattern, an estimation on the worst-case automaton size needed would still be a rough guess based on estimated length of field names.
  • we definitely want some limit of the automaton size because its better to error than to OOM

If the main motivation for introducing any limit here is the potential danger of reaching the size limit of the automaton used for “include_unmapped” fields, I think we can lower that risk even more by limiting its use to field patterns using wildcards. Concrete field paths (as in the case when enumerating known field names) can be directly looked up from source without using the automaton. I wonder if this leaves many non-esoteric use cases in which we would run into a size limit.

Relates to #60985

@cbuescher cbuescher added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.13.0 labels Mar 4, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Mar 4, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Mar 9, 2021
Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.

Relates to elastic#69983
@lukeelmers
Copy link
Member

lukeelmers commented Mar 10, 2021

This led us to think about whether we should put a limit on the number of fields (or field patterns?) that the API can retrieve, which could be a dynamic index setting like the ones we e.g. have for doc value fields (index.max_docvalue_fields_search).

Just want to mention that if we introduce a limit similar to docvalue fields, it will cause a regression in Kibana (specifically elastic/kibana#22897 will become an issue for Discover again). As @jtibshirani explained in elastic/kibana#75813, Kibana switched from using docvalue_fields to fields in order to avoid hitting this limit.

I'm not suggesting that there should be no limit whatsoever, but wanted to add this to your list of considerations as the team weighs the best path forward.

cc @timroes @ppisljar

Edit: Just saw this comment from the linked PR -- I guess this means no limit is planned for the time being?

@cbuescher
Copy link
Member Author

Just saw this comment from the linked PR -- I guess this means no limit is planned for the time being?

Correct, I think we tend to not introduce a soft or hard limit at the moment, however this issue is here to discuss this. I'm aware that should we introduce anything along these lines that it is important to consider Kibana and other users of the API.

cbuescher pushed a commit that referenced this issue Mar 16, 2021
Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.

Relates to #69983
cbuescher pushed a commit that referenced this issue Mar 16, 2021
Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.

Relates to #69983
@cbuescher
Copy link
Member Author

Closed by #69984

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

4 participants