Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate whether a data stream timestamp has been specified in a document #58119

Closed

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Jun 15, 2020

If the document is going to be index into a backing index of a data stream then
check whether a timestamp field has been specified and that exactly one timestamp
value has been specified.

Currently there is no concept of a required field in the mapping code. To me
the best place to add the data stream timestamp validation logic is in: ParseContext#postParse(...) line 481
If there is a better place then I happily move this new logic elsewhere.

In order to ParseContext to know whether an index is part of a data stream and
what the timestamp field is, a DataStream instance had to be passed down this
this place. This is why the change touches relatively many files compared to the
actual added logic. However this is needed and I don't see another way to do this.

Specifically looking for feedback from the @elastic/es-search team.

Relates to #53100

…ment

If the document is going to be index into a backing index of a data stream then
check whether a timestamp field has been specified and that exactly one timestamp
value has been specified.

Currently there is no concept of a required field in the mapping code. To me
the best place to add the data stream timestamp validation logic is in: `ParseContext#postParse(...)`.
If there is a better place then I happily move this new logic elsewhere.

In order to ParseContext to know whether an index is part of a data stream and
what the timestamp field is, a `DataStream` instance had to be passed down this
this place. This is why the change touches relatively many files compared to the
actual added logic. However this is needed and I don't see another way to do this.

Relates to elastic#53100
@martijnvg martijnvg added >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 :Data Management/Data streams Data streams and their lifecycles v7.9.0 labels Jun 15, 2020
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple smaller comments.

I think we should also add a REST test to demonstrate that error handling of a bulk request and a single index requests works as intended when no timestamp is specified.

DocumentParser(IndexSettings indexSettings, DocumentMapperParser docMapperParser, DocumentMapper docMapper) {
DocumentParser(IndexSettings indexSettings,
DocumentMapperParser docMapperParser,
DataStream dataStream,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order or args look strange, move dataStream to end?

@@ -233,14 +233,14 @@ PUT _index_template/template
"template": {
"mappings": {
"properties": {
"@timestamp": {
"date": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the original name better, since it complies with ECS and also date opens up for a bit of confusion between field name and type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but in the test data sets that is generated (huge twitter setup), has its timestamp in the date field. I think this should be changed in a followup change?

@jtibshirani jtibshirani self-requested a review June 17, 2020 17:34
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start to me. I left a idea on how we could restructure the check to avoid counting the Lucene fields.

Earlier we brainstormed whether the concept of a 'singleton' field would be useful more generally, for example as a mapping option that could apply to any type. This is still on my radar, but I think it's good we're not blocking on that. I agree with your approach of just making a targeted change for this important validation.

}
}

if (numStoredFields > 1 || numPointFields > 1 || numDocValuesFields > 1) {
Copy link
Contributor

@jtibshirani jtibshirani Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little fragile to be checking the Lucene fields that the timestamp field produces. Sometimes field mappers decide to produce multiple Lucene fields given a single value in the _source. Or the mapping could have doc values and indexing disabled.

Instead I think we could check that the timestamp is a 'singleton' during document parsing:

  • We could add a boolean flag to DateFieldMapper like isSingletonTimestamp, based on whether its field name matches the datastream timestamp field.
  • In DateFieldMapper#parseCreateField we would check + update a flag on ParseContext like alreadyParsedTimestamp. If it's already true, we throw an error.
  • In ParseContext#postParse, we verify that alreadyParsedTimestamp is true.

I like this because it avoids the fragility of checking Lucene fields, but still correctly handle cases where data is copied into the field like copy_to and multi-fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion @jtibshirani. I also found counting of Lucene fields to be fragile and the isSingletonTimestamp and alreadyParsedTimestamp flags will make this check more robust. I will try and adjust the code.

@martijnvg
Copy link
Member Author

Earlier we brainstormed whether the concept of a 'singleton' field would be useful more generally, for example as a mapping option that could apply to any type

This idea also crossed my mind. This pr kind of creates an implicit singleton field based on whether the index is part of a data stream. If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure looks good to me! I left some more detailed comments.

It's too bad we need to pass through an extra parameter in so many places. It's generally unusual that we have mapping information passed through externally -- usually all information affecting the schema/ document parsing can be found in the index metadata or settings. I don't really see a way around this though, I don't think we want to add dataStreamTimestampField to the index metadata, since it's a property of the 'index abstraction' and not the index?

If/when we change this to be a field mapper attribute we can force this setting when creating backing index. This new singleton attribute should be immutable like most of the other field mapping attributes.

I'll create an issue about this to start a discussion. I think this would help with my concern above, since most of the logic will be moved into an actual mapping attribute like singleton.

@@ -581,6 +598,13 @@ protected DateFieldMapper clone() {

@Override
protected void parseCreateField(ParseContext context) throws IOException {
if (singletonDataStreamTimestamp) {
if (context.isDataStreamTimestampParsed()) {
throw new IllegalArgumentException("timestamp field has multiple values, only a single value is allowed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, this could mention 'data stream timestamp' for clarity. We also try to include the field name when possible to help with debugging: "Encountered data stream timestamp field [my-timestamp] with multiple values ..."

idFieldDataEnabled, null);
}

public MapperService(IndexSettings indexSettings, IndexAnalyzers indexAnalyzers, NamedXContentRegistry xContentRegistry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we delete the extra MapperService constructor above, to make it harder to forget to pass the timestamp field? It looks like it's only used in tests and for simulating a merge.

this(indexSettings, mapperService, xContentRegistry, similarityService, mapperRegistry, queryShardContextSupplier, null);
}

public DocumentMapperParser(IndexSettings indexSettings, MapperService mapperService, NamedXContentRegistry xContentRegistry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought here, it'd be nice to delete the extra constructor above.

static class MultiFieldParserContext extends ParserContext {
MultiFieldParserContext(ParserContext in) {
super(in.similarityLookupService(), in.mapperService(), in.typeParsers(),
in.indexVersionCreated(), in.queryShardContextSupplier());
in.indexVersionCreated(), in.queryShardContextSupplier(), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass through the timestamp field here. I guess the timestamp could happen to be a multi-field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will pass down the timestamp field, a small note, the timestamp field can only be a field that is part of the _source.
There is validation that checks whether a field mapping exists of type date or date_nanos when creating the composable index template and this is also asserted when a backing index of a data stream is created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timestamp field can only be a field that is part of the _source.

Interesting! I am generally curious to catch up on timestamp mapping validation, I'll ping the team offline about this.

@@ -441,6 +455,11 @@ public void addDynamicMapper(Mapper mapper) {
}

void postParse() {
if (dataStreamTimestampField != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, can collapse these two 'if' checks.

Also, same thought as above about including 'data stream' and the field name in the error message.

@@ -464,6 +470,33 @@ public void testAliasActionsFailOnDataStreamBackingIndices() throws Exception {
"support aliases."));
}

public void testNoTimestampInDocument() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have good integration test coverage, but no unit tests. It would be great to add a unit test at the level of the mapping code that checks the document validation. Perhaps DateFieldMapperTests would be a good place for this?

Copy link
Contributor

@jtibshirani jtibshirani Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this comment, @martijnvg explained the unit tests are missing because this is a 'draft' :)

@martijnvg
Copy link
Member Author

I don't think we want to add dataStreamTimestampField to the index metadata, since it's a property of the 'index abstraction' and not the index?

It is a property of DataStream, the Metadata#indicesLookup sorted set with index abstractions is built from the information available in Metadata class which contains both the index metadata instances and the data stream instances.

We avoided adding data stream information to index metadata, because then the information that indicates whether an index is part of a data stream is in two places and then there is a risk of certain type of bugs if a data stream instance and index metadata instance go for some reason out of sync. An example would be if a backing index gets shrunken. The new index with less shards would need to be added to the data stream and the original index would need to be removed from the data stream. In this case both data stream instance and an index metadata instance would need to be updated.

@martijnvg martijnvg requested a review from jtibshirani June 24, 2020 11:55
@martijnvg martijnvg marked this pull request as ready for review June 24, 2020 12:00
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Data streams)

@elasticmachine elasticmachine added Team:Data Management Meta label for data/management team Team:Search Meta label for search team labels Jun 24, 2020
@martijnvg
Copy link
Member Author

I think this would help with my concern above, since most of the logic will be moved into an actual mapping attribute like singleton.

I think if the singleton feature existed today, then the approach taken in this pr would never have been done. I think with the singleton feature, this change will be much cleaner and less intrusive, because composable index templates with data stream definition would set singleton=true on the appropriate field and there is then no need to pass down the timestamp field all the way down to where it is now in the pr. Maybe we should try to introduce a singleton attribute? I did some exploring and I think it is doable: c830711#diff-d72103d748a7ab089c4a87707755fe3dR449

@martijnvg
Copy link
Member Author

I think if the singleton feature existed today, then the approach taken in this pr would never have been done. I think with the singleton feature, this change will be much cleaner and less intrusive, because composable index templates with data stream definition would set singleton=true on the appropriate field and there is then no need to pass down the timestamp field all the way down to where it is now in the pr. Maybe we should try to introduce a singleton attribute? I did some exploring and I think it is doable: c830711#diff-d72103d748a7ab089c4a87707755fe3dR449

I chatted with @jtibshirani via another channel and it is unsure whether something like a singleton field will be added and if so then then it is unsure how this should be exposed. So in the meantime, for data streams, the best way forward seems to be moving forward with this PR. When something like singleton field is added, then the migration can be easy, since the singleton attribute can be enabled automatically when creating a new backing index by ES. This way the migration will be easy.

@jtibshirani
Copy link
Contributor

I opened #58523 to discuss the idea of adding a 'singleton' flag to field mappers. I'll do a final review shortly!

@jimczi
Copy link
Contributor

jimczi commented Jun 25, 2020

Sorry, I am late in the discussion but I wonder if this could be implemented as a MetadataFieldMapper ?
Currently we do not allow to put metadata field in the _source but this is something that we could revisit for this new field.
The requirements described here are easy to implement in a metadata field mapper, they are unique and we can constrain the values to be present once and only once. The main advantage I see is that mappings would have a consistent and unified view of a timestamp field when enabled.
This could look like this:

"mappings": {
    "_timestamp": {
      "enabled": true 
    }
}

Today the timestamp field name is set when the data stream is created. This is flexible but I wonder how we plan to handle multiple data streams that don't share the same timestamp field name in a search request. In other words, will it be possible to sort documents by timestamp if I target more than one data stream ?
It seems that this flexibility in the naming is only required at ingest time ? If that's true then I wonder if we could use a unique metadata field and also create an alias field that would point to the metadata field when the data stream is created ?

@martijnvg
Copy link
Member Author

Sorry, I am late in the discussion but I wonder if this could be implemented as a MetadataFieldMapper ?

I think that could work. The postParse() method can check whether the value has been specified, but how would we enforce that a single value has been specified? In this pr that is the responsibility of the date field mapper and if it sees a value twice, then it fails. How would that work if we have a data stream timestamp meta field? In a previous iteration in ParseContext#postParse() the number of lucene fields were counted, but this logic is a bit fragile as it could contain a stored field, doc values field and points field. So we decided to move away from that.

This is flexible but I wonder how we plan to handle multiple data streams that don't share the same timestamp field name in a search request. In other words, will it be possible to sort documents by timestamp if I target more than one data stream ?

We have not discussed that yet. We focussed on at least ensuring that each document has a timestamp value. Right now if a data stream has different timestamp fields then sorting is like if you try to sort over non uniform indices.

If that's true then I wonder if we could use a unique metadata field and also create an alias field that would point to the metadata field when the data stream is created ?

I like this idea. But this could also be resolved at query parse time? If we know that we sort over the primary timestamp field of data streams then at query parse time we could resolve to the right field?

@martijnvg
Copy link
Member Author

I chatted with @jimczi via another channel and we see benefits in developing the timestamp field validation as metadata field mapper. The metadata field mapper implementation would indicate what the timestamp field is. In the postParse() method it would check whether there is exactly one points field in the captures lucene document. We only need to check for point fields, because it doesn't make sense for this to be disabled. Validation is going to be added to will disallow setting index attribute to no, so the validation logic wouldn't be fragile as it was in the initial commit of this pr. I will work on a draft pr, to whether the implementation with a metadata field mapper will be cleaner than this pr.

if (context.isDataStreamTimestampParsed()) {
throw new IllegalArgumentException("data stream timestamp field [" + name() + "] encountered multiple values");
}
context.setDataStreamTimestampParsed(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that we should probably only set this after the date has been successfully parsed? Just writing this down to not forget, I see that we may be changing the strategy and moving to a metadata field mapper.

@jtibshirani
Copy link
Contributor

jtibshirani commented Jun 25, 2020

chatted with @jimczi via another channel and we see benefits in developing the timestamp field validation as metadata field mapper. The metadata field mapper implementation would indicate what the timestamp field is.

One aspect I like about the metadata field approach is that it consolidates the information into the mapping itself, as opposed to passing it down externally from the datastream definition. This is a good property to maintain -- that all information affecting schema/ document parsing can be found in the index mappings or settings.

@martijnvg
Copy link
Member Author

One aspect I like about the metadata field approach is that it consolidates the information into the mapping itself, as opposed to passing it down externally from the datastream definition. This is a good property to maintain -- that all information affecting schema/ document parsing can be found in the index mappings or settings

Yes, I agree and it can make sorting on data streams with different timestamp fields easier (a user would sort by the meta field instead of the actual data stream timestamp field).

@martijnvg
Copy link
Member Author

Closing this pr in favour of #58582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Data Management Meta label for data/management team Team:Search Meta label for search team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants