-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate whether a data stream timestamp has been specified in a document #58119
Validate whether a data stream timestamp has been specified in a document #58119
Conversation
…ment If the document is going to be index into a backing index of a data stream then check whether a timestamp field has been specified and that exactly one timestamp value has been specified. Currently there is no concept of a required field in the mapping code. To me the best place to add the data stream timestamp validation logic is in: `ParseContext#postParse(...)`. If there is a better place then I happily move this new logic elsewhere. In order to ParseContext to know whether an index is part of a data stream and what the timestamp field is, a `DataStream` instance had to be passed down this this place. This is why the change touches relatively many files compared to the actual added logic. However this is needed and I don't see another way to do this. Relates to elastic#53100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple smaller comments.
I think we should also add a REST test to demonstrate that error handling of a bulk request and a single index requests works as intended when no timestamp is specified.
DocumentParser(IndexSettings indexSettings, DocumentMapperParser docMapperParser, DocumentMapper docMapper) { | ||
DocumentParser(IndexSettings indexSettings, | ||
DocumentMapperParser docMapperParser, | ||
DataStream dataStream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order or args look strange, move dataStream
to end?
@@ -233,14 +233,14 @@ PUT _index_template/template | |||
"template": { | |||
"mappings": { | |||
"properties": { | |||
"@timestamp": { | |||
"date": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the original name better, since it complies with ECS and also date
opens up for a bit of confusion between field name and type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but in the test data sets that is generated (huge twitter setup), has its timestamp in the date field. I think this should be changed in a followup change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start to me. I left a idea on how we could restructure the check to avoid counting the Lucene fields.
Earlier we brainstormed whether the concept of a 'singleton' field would be useful more generally, for example as a mapping option that could apply to any type. This is still on my radar, but I think it's good we're not blocking on that. I agree with your approach of just making a targeted change for this important validation.
server/src/main/java/org/elasticsearch/index/mapper/MapperService.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
if (numStoredFields > 1 || numPointFields > 1 || numDocValuesFields > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little fragile to be checking the Lucene fields that the timestamp field produces. Sometimes field mappers decide to produce multiple Lucene fields given a single value in the _source
. Or the mapping could have doc values and indexing disabled.
Instead I think we could check that the timestamp is a 'singleton' during document parsing:
- We could add a boolean flag to
DateFieldMapper
likeisSingletonTimestamp
, based on whether its field name matches the datastream timestamp field. - In
DateFieldMapper#parseCreateField
we would check + update a flag onParseContext
likealreadyParsedTimestamp
. If it's already true, we throw an error. - In
ParseContext#postParse
, we verify thatalreadyParsedTimestamp
is true.
I like this because it avoids the fragility of checking Lucene fields, but still correctly handle cases where data is copied into the field like copy_to
and multi-fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion @jtibshirani. I also found counting of Lucene fields to be fragile and the isSingletonTimestamp
and alreadyParsedTimestamp
flags will make this check more robust. I will try and adjust the code.
This idea also crossed my mind. This pr kind of creates an implicit singleton field based on whether the index is part of a data stream. If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes. |
since these documents have an empty source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall structure looks good to me! I left some more detailed comments.
It's too bad we need to pass through an extra parameter in so many places. It's generally unusual that we have mapping information passed through externally -- usually all information affecting the schema/ document parsing can be found in the index metadata or settings. I don't really see a way around this though, I don't think we want to add dataStreamTimestampField
to the index metadata, since it's a property of the 'index abstraction' and not the index?
If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes.
I'll create an issue about this to start a discussion. I think this would help with my concern above, since most of the logic will be moved into an actual mapping attribute like singleton
.
@@ -581,6 +598,13 @@ protected DateFieldMapper clone() { | |||
|
|||
@Override | |||
protected void parseCreateField(ParseContext context) throws IOException { | |||
if (singletonDataStreamTimestamp) { | |||
if (context.isDataStreamTimestampParsed()) { | |||
throw new IllegalArgumentException("timestamp field has multiple values, only a single value is allowed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, this could mention 'data stream timestamp' for clarity. We also try to include the field name when possible to help with debugging: "Encountered data stream timestamp field [my-timestamp] with multiple values ..."
idFieldDataEnabled, null); | ||
} | ||
|
||
public MapperService(IndexSettings indexSettings, IndexAnalyzers indexAnalyzers, NamedXContentRegistry xContentRegistry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we delete the extra MapperService
constructor above, to make it harder to forget to pass the timestamp field? It looks like it's only used in tests and for simulating a merge.
this(indexSettings, mapperService, xContentRegistry, similarityService, mapperRegistry, queryShardContextSupplier, null); | ||
} | ||
|
||
public DocumentMapperParser(IndexSettings indexSettings, MapperService mapperService, NamedXContentRegistry xContentRegistry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thought here, it'd be nice to delete the extra constructor above.
static class MultiFieldParserContext extends ParserContext { | ||
MultiFieldParserContext(ParserContext in) { | ||
super(in.similarityLookupService(), in.mapperService(), in.typeParsers(), | ||
in.indexVersionCreated(), in.queryShardContextSupplier()); | ||
in.indexVersionCreated(), in.queryShardContextSupplier(), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pass through the timestamp field here. I guess the timestamp could happen to be a multi-field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will pass down the timestamp field, a small note, the timestamp field can only be a field that is part of the _source.
There is validation that checks whether a field mapping exists of type date or date_nanos when creating the composable index template and this is also asserted when a backing index of a data stream is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the timestamp field can only be a field that is part of the _source.
Interesting! I am generally curious to catch up on timestamp mapping validation, I'll ping the team offline about this.
@@ -441,6 +455,11 @@ public void addDynamicMapper(Mapper mapper) { | |||
} | |||
|
|||
void postParse() { | |||
if (dataStreamTimestampField != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, can collapse these two 'if' checks.
Also, same thought as above about including 'data stream' and the field name in the error message.
@@ -464,6 +470,33 @@ public void testAliasActionsFailOnDataStreamBackingIndices() throws Exception { | |||
"support aliases.")); | |||
} | |||
|
|||
public void testNoTimestampInDocument() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have good integration test coverage, but no unit tests. It would be great to add a unit test at the level of the mapping code that checks the document validation. Perhaps DateFieldMapperTests
would be a good place for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this comment, @martijnvg explained the unit tests are missing because this is a 'draft' :)
It is a property of We avoided adding data stream information to index metadata, because then the information that indicates whether an index is part of a data stream is in two places and then there is a risk of certain type of bugs if a data stream instance and index metadata instance go for some reason out of sync. An example would be if a backing index gets shrunken. The new index with less shards would need to be added to the data stream and the original index would need to be removed from the data stream. In this case both data stream instance and an index metadata instance would need to be updated. |
Pinging @elastic/es-core-features (:Core/Features/Data streams) |
I think if the singleton feature existed today, then the approach taken in this pr would never have been done. I think with the |
I chatted with @jtibshirani via another channel and it is unsure whether something like a singleton field will be added and if so then then it is unsure how this should be exposed. So in the meantime, for data streams, the best way forward seems to be moving forward with this PR. When something like singleton field is added, then the migration can be easy, since the singleton attribute can be enabled automatically when creating a new backing index by ES. This way the migration will be easy. |
I opened #58523 to discuss the idea of adding a 'singleton' flag to field mappers. I'll do a final review shortly! |
Sorry, I am late in the discussion but I wonder if this could be implemented as a "mappings": {
"_timestamp": {
"enabled": true
}
} Today the timestamp field name is set when the data stream is created. This is flexible but I wonder how we plan to handle multiple data streams that don't share the same timestamp field name in a search request. In other words, will it be possible to sort documents by timestamp if I target more than one data stream ? |
I think that could work. The
We have not discussed that yet. We focussed on at least ensuring that each document has a timestamp value. Right now if a data stream has different timestamp fields then sorting is like if you try to sort over non uniform indices.
I like this idea. But this could also be resolved at query parse time? If we know that we sort over the primary timestamp field of data streams then at query parse time we could resolve to the right field? |
I chatted with @jimczi via another channel and we see benefits in developing the timestamp field validation as metadata field mapper. The metadata field mapper implementation would indicate what the timestamp field is. In the |
if (context.isDataStreamTimestampParsed()) { | ||
throw new IllegalArgumentException("data stream timestamp field [" + name() + "] encountered multiple values"); | ||
} | ||
context.setDataStreamTimestampParsed(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that we should probably only set this after the date has been successfully parsed? Just writing this down to not forget, I see that we may be changing the strategy and moving to a metadata field mapper.
One aspect I like about the metadata field approach is that it consolidates the information into the mapping itself, as opposed to passing it down externally from the datastream definition. This is a good property to maintain -- that all information affecting schema/ document parsing can be found in the index mappings or settings. |
Yes, I agree and it can make sorting on data streams with different timestamp fields easier (a user would sort by the meta field instead of the actual data stream timestamp field). |
Closing this pr in favour of #58582 |
If the document is going to be index into a backing index of a data stream then
check whether a timestamp field has been specified and that exactly one timestamp
value has been specified.
Currently there is no concept of a required field in the mapping code. To me
the best place to add the data stream timestamp validation logic is in:
ParseContext#postParse(...)
line 481If there is a better place then I happily move this new logic elsewhere.
In order to ParseContext to know whether an index is part of a data stream and
what the timestamp field is, a
DataStream
instance had to be passed down thisthis place. This is why the change touches relatively many files compared to the
actual added logic. However this is needed and I don't see another way to do this.
Specifically looking for feedback from the @elastic/es-search team.
Relates to #53100