-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Support unmapped fields in search 'fields' option #64651
Conversation
Currently, the 'fields' option only supports fetching mapped fields. Since 'fields' is meant to be the central place to retrieve document content, it should allow for loading unmapped values. This change adds implementation and tests for this addition. Closes elastic#63690
@jtibshirani maybe you can give some quick feedback if this goes into the right direction. I left out any details around enabling fetching the unmapped fields (whether by default or though a flag, or how the API looks like in detail). Also some decisions regarding edge cases I took that we might discuss again:
I'm still going to look at whether the XContentMapValues#filter functionality that is currently also used for source filtering can be used instead of the implementation here. I'm not sure that it would be more performant but will take a look. |
The failing test above shows another interesting edge case with "flattened" fields. The current mechanism assumes that we cannot retrieve the internals of the "flattened" fields content. With the ability to look up from "source" these can be returned though. See the wildcard test case for details. When searching for "flat*" we return both "flattened" and "flattened.some_field" in this case. Could be seen more as a feature than a bug though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher this looks like a nice start to me. I added some early thoughts on the implementation.
|
||
public FetchFieldsContext(List<FieldAndFormat> fields) { | ||
public FetchFieldsContext(List<FieldAndFormat> fields, boolean includeUnmapped) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about: if we end up making this configurable through a flag like include_unmapped
, we could introduce a top-level parameter as we do here, or instead support the flag alongside each field pattern (so it'd be part of FieldAndFormat
).
} | ||
Function<Map<String, ?>, Map<String, Object>> filter = XContentMapValues.filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-using some source filtering logic seems to be a good fit. A couple ideas:
- Perhaps we shouldn't add the concrete field names as exclusions. The automata used for filtering have limits on the number of states, and I think we could run into these limits if the field patterns resolve to a large number of concrete fields. Here's an example exception from source filtering: No support for max_determinized_states in _source: includes #53739.
- A slightly different approach would be to filter and collect the field entries at the same time, instead of doing two passes. This would involve writing a similar method to
XContentMapValues#filter
, but that flattens and collects the key-value pairs too (maybe something likeXContentMapValues#filterAndFlatten
?) I'm not sure this will turn out more cleanly, but wanted to mention the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I'll need to better understand at how XContentMapValues#filter works to see if and how this could work.
if (value instanceof Map) { | ||
collectAllPaths(currentPath + ".", (Map<String, Object>) value, documentFields); | ||
} else { | ||
DocumentField f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might miss lists of objects? Looking at XContentMapValues#filter
, there's a special recursive case for lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look but though I have a test for this (essentially array values?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, do you mean walking e.g. Lists of objects to their leaf values? e.g.
{
"foo" : [ { "f1" : "value1" }, { "f2" : "value2" } ]
}
How does the current "fields" lookup work for this, e.g. what does the path look like for the "f1" value= "foo.1.f1" ? Maybe I'm just confused here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion, I read the logic too quickly before. We won't miss anything, but the result structure could be surprising.
For mapped fields, the 'fields' option always flattens arrays of objects. For example, given a document like
"object": [{ "field": "value1"}, {"field": "value2" }]
a request for "fields": ["object.field"]
will return
"fields": { "object.field": ["value1", "value2"]}
With this logic, for unmapped fields we'll return
"fields": { "object": [{"field": "value1"}, {"field": "value2"} ]}
Perhaps this behavior is inconsistent, I think we'll still want to flatten arrays of objects (especially if we already flatten objects when they're not in array?) I'm definitely up for discussing this though, it's a question we noted on the original issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now and totally agree. I think I ran into this working on improvements on Friday anyway, should be part of my next update here.
@@ -154,5 +154,6 @@ | |||
|
|||
- match: { hits.total.value: 1 } | |||
- length: { hits.hits: 1 } | |||
- length: { hits.hits.0.fields: 1 } | |||
- length: { hits.hits.0.fields: 2 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani maybe you have an opinion on this. Seems like we currently see all substructure under a mapped flattened fields as a "leaf" object, regardless of what it contains. With the current implementation of fetching unmapped fields we'd be able to descend into those objects and return them. Or the other way round, if we don't do that, we somehow have to mark the mapped "flattened" nodes in "_source" and stop walking the tree there later when retrieving the unmapped values. Not sure what users would expect here typically, so I went with adapting the expectations in the test for now. Happy to discuss though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems okay to me for now -- in some sense the sub-fields of the flattened field are in fact 'unmapped'. However I agree the behavior of flattened fields is generally tricky, I've made a note on the meta issue (#60985) to think through how to best handle flattened data.
return documentFields; | ||
} | ||
|
||
private void collect(Map<String, DocumentField> documentFields, Map<String, Object> source, String parentPath, int lastState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a bit complex, I'm not sure my idea to do it all in one pass was a good one :) Maybe we could start by re-using the source filtering logic, optimizing later if we see a performance reason ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at what the source filtering code does I actually find this solution less complicated. We are doing less work here than in the source filtering code where whole sub-trees of the source can be filtered out an the exclusion logic more complicated. These two methods that are pulling out the data from the source map while keeping track of where they are in the source tree via the inclusion pattern automaton are not recursive but not that long, and they are implementational details that we don't need to expose. On the other hand, using the source filtering code that was written for another purpose initially on the other hand might lead to problems when that code needs to be altered in the future. Having our "own" logic here doesn't seem to much overhead to me tbh.
@jtibshirani I'm going to close this draft in favour of a cleaned up, squashed version that includes additional docs and some more test at #65386. Hope this is fine by you. Thanks for the reviews so far here. |
Currently, the 'fields' option only supports fetching mapped fields. Since
'fields' is meant to be the central place to retrieve document content, it
should allow for loading unmapped values.
This change adds implementation and tests for this addition.
Closes #63690
Opening an early WIP for feedback