Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve lookup for "include_unmapped" field pattern #69984

Merged

Conversation

cbuescher
Copy link
Member

Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.

Relates to #69983

@cbuescher cbuescher added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.13.0 labels Mar 4, 2021
@cbuescher cbuescher requested a review from jtibshirani March 4, 2021 16:52
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Mar 4, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@cbuescher
Copy link
Member Author

Briefly talked to the team and @jimczi about this with respect to the question of whether we need any additional limit on the number of field patterns retrieved and we agreed that in addition to making the "inlcude_unmapped" option using the automaton less often we should also raise the default number of allowed states, but don't need an explicit limit (as in a setting etc...) with these to actions in place any more.

Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.

Relates to elastic#69983
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just a couple small questions + comments.

/**
* Default maximum number of states in the automaton that looks up unmapped fields.
*/
private static final int AUTOMATON_MAX_DETERMINIZED_STATES = 100000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, I'm wondering if there's reasoning behind choosing 100,000?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in particular. It was one of the numbers brought up. Should provide ample headroom, fortunetely the automanon grabs memory lazily so this should never use up much space in normal cases.

@@ -846,6 +847,23 @@ public void testLastFormatWins() throws IOException {
assertThat(fields.get("date_field").getValues().get(1), equalTo("12"));
}

/**
* Field pattern retrieved with "inlcude_unmapped" use an automaton with a maximal allowed size internally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small typos: pattern -> patterns, inlcude_unmapped -> include_unmapped.

Map<Boolean, List<String>> partitions = unmappedFetchPattern.stream()
.collect(Collectors.partitioningBy((s -> Regex.isSimpleMatchPattern(s))));
List<String> unmappedWildcardPattern = partitions.get(true);
List<String> unmappedConcreteFields = partitions.get(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a short comment about why we are doing this? It is not so obvious, as it addresses a bit of an edge case (specifying many concrete field paths, but also requesting include_unmapped: true).

@cbuescher
Copy link
Member Author

@jtibshirani thanks for the review, I pushed some updates and hopefully answered your questions

@@ -148,6 +164,17 @@ private FieldFetcher(
if (this.unmappedFieldsFetchAutomaton != null) {
collectUnmapped(documentFields, sourceLookup.source(), "", 0);
}
if (this.unmappedConcreteFields != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, rereading this maybe it makes sense to move this into collectUnmapped? That way all the unmapped logic is consolidated into one method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cbuescher
Copy link
Member Author

@jtibshirani thanks, I pushed an update

@cbuescher cbuescher merged commit c2ae7cb into elastic:master Mar 16, 2021
cbuescher pushed a commit that referenced this pull request Mar 16, 2021
Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.

Relates to #69983
@cbuescher
Copy link
Member Author

@jtibshirani thanks, I merged to master and 7.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants