-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow metadata fields in the _source #61590
Conversation
Pinging @elastic/es-search (:Search/Mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that DocumentParser
will require some additional changes -- it looks like FieldMapper#parse
will never called on a metadata field, even if isAllowedInSource
is enabled and the value is present in the _source. It'd be great to add a test using a dummy MetadataFieldMapper
to check that everything is wired up correctly.
Added test testDocumentContainsAllowedMetadataField()
Fixed broken tests
@elasticmachine run elasticsearch-ci/packaging-sample-windows |
server/src/test/java/org/elasticsearch/index/mapper/MockMetadataMapperPlugin.java
Show resolved
Hide resolved
/** | ||
* @return Whether a metadata field can be included in the document _source. | ||
*/ | ||
default boolean isAllowedInSource() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we really need this new flag isAllowedInSource
. Maybe we could avoid adding a new flag and instead do the following:
- For metadata fields that are not allowed in _source, make sure that
MetadataFieldMapper#parse
throws an descriptive error. - Many metadata fields currently have logic in
parse
that's unrelated to _source parsing. We could make sure to move it into the special methodspreParse
orpostParse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani I agree with you that delegating this to MetadataFieldMapper#parse()
would be a simpler solution. However I see two caveats:
-
The approach with
isAllowedInSource()
method detects that a metadata field is being parsed when parsing the field name. This is early in the parsing process. An exception is thrown and the field value parsing is skipped. -
As you mention, I have seen classes (such as
IgnoredFieldMapper
andVersionFieldMapper
) that overrideparse()
method to do nothing on one hand, while callingsuper.parse()
frompreParse()
/postParse()
methods. This approach is too complicated imho, but I am not sure that refactoring this is very simple. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is early in the parsing process. An exception is thrown and the field value parsing is skipped.
I don't think we need to optimize performance in this case because it's an error condition (and should be a rare one too).
This approach is too complicated imho, but I am not sure that refactoring this is very simple.
I agree that the logic is complex/ hard to read! I haven't deeply looked into the refactoring myself -- maybe you could try it out and we can rediscuss the approach if you see a roadblock ?
throw new MapperParsingException("Field [" + currentFieldName + "] is a metadata field and cannot be added inside" | ||
+ " a document. Use the index API request parameters."); | ||
if (context.mapperService().isFieldAllowedInSource(context.path().pathAsText(currentFieldName))) { | ||
// If token is a metadata field and is allowed in source, parse its value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for rejecting null and non-leaf values here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi and I had a discussion about the possible use cases we want to cover with this feature in the near future and agreed that for the sake of simplicity we should only accept non-null values (no objects or arrays).
I don't have any strong opinions and I am happy to revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think null should be accepted but non-leaf values seemed more challenging. Now that I re-think about it, it shouldn't be an issue as long as we consume the value from the root (object or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So +1 to handle object, arrays and null values and let the metadata field mapper deals with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me work on this again
Delegated this functionality to MetadataFieldMapper.parse()
@jtibshirani I pushed changes that address your comments. Please let me know what you think. :) |
} | ||
|
||
@Override | ||
protected void parseCreateField(ParseContext context) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of overriding parse
here, could we override parseCreateField
?
@Override
protected void parseCreateField(ParseContext context) throws IOException {
throw new MapperParsingException("Field [" + name() + "] is a metadata field and cannot be added inside"
+ " a document. Use the index API request parameters.");
}
Then we could do the following:
- For meta fields that cannot be specified in _source, move all the relevant logic out of
parseCreateField
and intopreParse
orpostParse
as appropriate. - Remove the new
doParse
method, as it's no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I had a hard time thinking a simple and clean way to refactor this part.
Initially, I tried the approach you suggested, but FieldMapper#parse()
does all the exception handling by encapsulating all exceptions thrown by parseCreateField()
offering a preview of the parsed value.
(See
elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java
Line 249 in 9a127ad
} catch (Exception e) { |
This means that the
throw new MapperParsingException("Field [" + name() + "] is a metadata field and cannot be added inside"
+ " a document. Use the index API request parameters.");
will be encapsulated in the following exception:
throw new MapperParsingException("failed to parse field [{}] of type [{}] in document with id '{}'. " +
"Preview of field's value: '{}'", e, fieldType().name(), fieldType().typeName(),
context.sourceToParse().id(), valuePreview);
On the other hand, methods preParse()
and postParse()
delegate parsing to parse()
for the same reason (to handle parsing exception). If we were to move all parsing in those two methods, we would have to replicate exception handling in those methods as well.
Since all FieldMapper#parse()
does is to delegate parsing to parseCreateField()
and only do the exception handling, I decided to redirect this functionality in 'MetadataFieldMapper#doParse()and call this method from
preParse()and
postParse()`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me that approach you tried/ I suggested is the cleanest way to go. My reasoning...
This means that the ... will be encapsulated in the following exception:
This seems like an okay compromise to me, all the exception information is there so the user can determine the cause. It's a little confusing that we wrap the message, but this should be a rare situation.
On the other hand, methods preParse() and postParse() delegate parsing to parse() for the same reason (to handle parsing exception).
I think that exception handling code is only really helpful when dealing with incorrect user-supplied values. Since preParse
and postParse
work with internal data such as context.sourceToParse().id()
, it would indicate a serious logic error. So I don't think we'd need the same general exception handling strategy where we create a nice user-facing message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me, I have one last comment. Thanks for all the iterations on this!
private static Mapper getMapper(final ParseContext context, ObjectMapper objectMapper, String fieldName, String[] subfields) { | ||
String fieldPath = context.path().pathAsText(fieldName); | ||
if (context.mapperService().isMetadataField(fieldPath)) { | ||
for (MetadataFieldMapper metadataFieldMapper : context.docMapper().mapping().getMetadataMappers()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed we're doing a linear scan through all the metadata mappers. Maybe we should make sure to store them as a map from name -> mapper to avoid this overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I had this point in mind for improvement. I just saw that linear scan happens at:
elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java
Line 101 in b8b3a9d
for (MetadataFieldMapper metadataMapper : metadataFieldsMappers) { |
I understand that pre/postParse() internalParseDocument()
is called once per document, while the linear scan in getMapper()
is executed once per metadata field.
I will fix this asap!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this last bit. Can you please have one last look?
Thank you for your patience and excellent guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense to me.
It feels redundant that we store an array of metadata mappers, a map from class -> mapper, and a map from name -> mapper. This could be simplified, but we don't necessarily need to do it in this PR.
Instead of linear scan
Backports #61590 to 7.x So far we don't allow metadata fields in the document _source. However, in the case of the _doc_count field mapper (#58339) we want to be able to set This PR adds a method to the metadata field parsers that exposes if the field can be included in the document source or not. This way each metadata field can configure if it can be included in the document _source
Removed |
So far we don't allow metadata fields in the document
_source
. However, in the case of the_doc_count
field mapper (#58339) we want to be able to setThis PR adds a method to the metadata field parsers that exposes if the field can be included in the document source or not.
This way each metadata field can configure if it can be included in the document
_source