-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve lookup for "include_unmapped" field pattern #69984
Improve lookup for "include_unmapped" field pattern #69984
Conversation
Pinging @elastic/es-search (Team:Search) |
Briefly talked to the team and @jimczi about this with respect to the question of whether we need any additional limit on the number of field patterns retrieved and we agreed that in addition to making the "inlcude_unmapped" option using the automaton less often we should also raise the default number of allowed states, but don't need an explicit limit (as in a setting etc...) with these to actions in place any more. |
Currently we use a CharacterRunAutomaton build from all field pattern that have the "include_unmapped" option set. This can lead to unnecessarily large automata for cases where the user unessesarily list all known fields and adds the "include_unmapped" option. We only really need the automaton for pattern that contain wildcards and can look up any other field path directly and only need to do so if the field path isn't indeed mapped. Relates to elastic#69983
a7ccdfb
to
11a29c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, just a couple small questions + comments.
/** | ||
* Default maximum number of states in the automaton that looks up unmapped fields. | ||
*/ | ||
private static final int AUTOMATON_MAX_DETERMINIZED_STATES = 100000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, I'm wondering if there's reasoning behind choosing 100,000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in particular. It was one of the numbers brought up. Should provide ample headroom, fortunetely the automanon grabs memory lazily so this should never use up much space in normal cases.
@@ -846,6 +847,23 @@ public void testLastFormatWins() throws IOException { | |||
assertThat(fields.get("date_field").getValues().get(1), equalTo("12")); | |||
} | |||
|
|||
/** | |||
* Field pattern retrieved with "inlcude_unmapped" use an automaton with a maximal allowed size internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small typos: pattern -> patterns, inlcude_unmapped -> include_unmapped.
Map<Boolean, List<String>> partitions = unmappedFetchPattern.stream() | ||
.collect(Collectors.partitioningBy((s -> Regex.isSimpleMatchPattern(s)))); | ||
List<String> unmappedWildcardPattern = partitions.get(true); | ||
List<String> unmappedConcreteFields = partitions.get(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add a short comment about why we are doing this? It is not so obvious, as it addresses a bit of an edge case (specifying many concrete field paths, but also requesting include_unmapped: true
).
@jtibshirani thanks for the review, I pushed some updates and hopefully answered your questions |
@@ -148,6 +164,17 @@ private FieldFetcher( | |||
if (this.unmappedFieldsFetchAutomaton != null) { | |||
collectUnmapped(documentFields, sourceLookup.source(), "", 0); | |||
} | |||
if (this.unmappedConcreteFields != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, rereading this maybe it makes sense to move this into collectUnmapped
? That way all the unmapped logic is consolidated into one method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jtibshirani thanks, I pushed an update |
Currently we use a CharacterRunAutomaton build from all field pattern that have the "include_unmapped" option set. This can lead to unnecessarily large automata for cases where the user unessesarily list all known fields and adds the "include_unmapped" option. We only really need the automaton for pattern that contain wildcards and can look up any other field path directly and only need to do so if the field path isn't indeed mapped. Relates to #69983
@jtibshirani thanks, I merged to master and 7.x |
Currently we use a CharacterRunAutomaton build from all field pattern that have
the "include_unmapped" option set. This can lead to unnecessarily large automata
for cases where the user unessesarily list all known fields and adds the
"include_unmapped" option. We only really need the automaton for pattern that
contain wildcards and can look up any other field path directly and only need to
do so if the field path isn't indeed mapped.
Relates to #69983