Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data stream timestamp validation via metadata field mapper #58582

Merged
merged 21 commits into from
Jul 2, 2020

Conversation

martijnvg
Copy link
Member

This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field.

The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.

This is a different variant of #58119, which also tries to add timestamp field validation when
ingesting documents. This implementation is cleaner than in #58119, because the validation
is encapsulated in the DataStreamTimestampFieldMapper class and the data stream
timestamp field name is provided to it as mapping configuration. Whereas in #58119
hard coded validation logic was added in ParseContext#postParse(...) and DateFieldMapper,
also a data stream timestamp fieldname field/variable was added in many other classes in
order to get the right information in the right place.

Relates to #53100

This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field.

The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.

Relates to elastic#53100
@martijnvg martijnvg added :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 :Data Management/Data streams Data streams and their lifecycles v7.9.0 labels Jun 26, 2020
…sts were failing

because these tests assumed any field longer then 20 chars is too long.
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying this road @martijnvg . I left some questions and comments but I like where this is going.

@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
boolean includeDefaults = params.paramAsBoolean("include_defaults", false);
if (includeDefaults == false && fieldName == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we always return when the fieldName is null ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with doing that, I just thought that if includeDefaults is true then that the fieldName should be returned anyway, irregardless whether it has a value. But I guess it is ok to ignore includeDefaults if fieldName is null?

}

// For now the field shouldn't be useable in searches.
// In the future it should act as an alias to the actual data stream timestamp field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what this would look like though. It's very similar to the alias field so I wonder what @jtibshirani thinks of how this should work ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be similar to the code in FieldTypeLookup that checks FieldAliasMapper#path(). I think we can maybe piggyback on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field alias framework is narrow in scope and unfortunately isn't very extensible. @romseygeek is working on some refactors that will let field mappers delegate to other field types to expose search functionality. I think it will be cleaner and more forward-looking to collaborate on that.

Stepping back though, I don't know if this timestamp mapper should serve as a search-time alias for the delegate field. To me it'd be fine if it were just a hidden mapper that performed validation and recorded the data stream timestamp in the mappings. Generally we lean heavily on ECS to provide a consistent set of field names across individual schemas, and ECS has a convenient @timestamp field at the top level. Users can also define field aliases to allow for searching across different data streams. Does the 'timestamp' field need special treatment?

One advantage I can think of is that we could allow the delegate timestamp field to change for future indices in a data stream, and searches keep working on _timestamp. This doesn't seem like a common case though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we lean heavily on ECS to provide a consistent set of field names across individual schemas, and ECS has a convenient @timestamp field at the top level. Users can also define field aliases to allow for searching across different data streams. Does the 'timestamp' field need special treatment?

In this case, just using @timestamp is imo preferable and sorting by meta field _timestamp would not be needed.

One advantage I can think of is that we could allow the delegate timestamp field to change for future indices in a data stream, and searches keep working on _timestamp.

I think this becomes more useful, in cases existing indices get imported into a data stream, which is something that we would want to support down the line.

However for now, I think there is no need for this, as the initial targeted users use ECS. I think we will be looking into this somewhere in the not so distant future.

import java.util.Iterator;
import java.util.Map;

public class DataStreamTimestampFieldMapper extends MetadataFieldMapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to reference the data stream naming ? It's a generic timestamp metadata field, right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakrone brought up the good point that we used to have a _timestamp metadata field (#18980), so there may be some confusion over naming. For me the right naming depends some open questions like if we'll allow searches on this field, and if we see a use for it outside of datastreams.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we see a use for it outside of datastreams.

Data steams targets append-only time based data sources. I think use cases that are not append-only, will keep on using indices + alias pattern. I think this usecase may also benefit from having a _timestamp metadata field.

// not configured, so skip the validation
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add validation in the MapperService that the target field is of type date and that it is indexed and has doc values. I know that you validate this in the data stream code but it would be cleaner/safer to do it directly in the MapperService since the metadata field could be used outside of data streams ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do it directly in the MapperService since the metadata field could be used outside of data streams ?

If this metadata field mapper has value outside data streams then yes it make sense to add the validation in MapperService.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of moving the validation down into the mapping code even if the field mapper isn't used outside of data streams. It's really nice to have all mapping validation embedded in the mappers/ mapping logic instead of some living outside.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice idea from @jimczi!

throw new IllegalArgumentException("data stream timestamp field [" + fieldName + "] is missing");
}

long numberOfValues =
Copy link
Contributor

@jtibshirani jtibshirani Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could instead assert that the field type has at least one docvalues field? This would be more robust if we decide to stop indexing points (#48665).

// not configured, so skip the validation
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of moving the validation down into the mapping code even if the field mapper isn't used outside of data streams. It's really nice to have all mapping validation embedded in the mappers/ mapping logic instead of some living outside.

import java.util.Iterator;
import java.util.Map;

public class DataStreamTimestampFieldMapper extends MetadataFieldMapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakrone brought up the good point that we used to have a _timestamp metadata field (#18980), so there may be some confusion over naming. For me the right naming depends some open questions like if we'll allow searches on this field, and if we see a use for it outside of datastreams.

}

// For now the field shouldn't be useable in searches.
// In the future it should act as an alias to the actual data stream timestamp field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field alias framework is narrow in scope and unfortunately isn't very extensible. @romseygeek is working on some refactors that will let field mappers delegate to other field types to expose search functionality. I think it will be cleaner and more forward-looking to collaborate on that.

Stepping back though, I don't know if this timestamp mapper should serve as a search-time alias for the delegate field. To me it'd be fine if it were just a hidden mapper that performed validation and recorded the data stream timestamp in the mappings. Generally we lean heavily on ECS to provide a consistent set of field names across individual schemas, and ECS has a convenient @timestamp field at the top level. Users can also define field aliases to allow for searching across different data streams. Does the 'timestamp' field need special treatment?

One advantage I can think of is that we could allow the delegate timestamp field to change for future indices in a data stream, and searches keep working on _timestamp. This doesn't seem like a common case though.

@martijnvg martijnvg marked this pull request as ready for review June 29, 2020 14:10
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Data streams)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is looking good to me. I added some detailed review comments.

I had a couple last high-level questions:

  • Currently a user could configure a _timestamp field directly in the mappings. It's confusing, but I don't see a way around it. I also don't think it would cause problems, since we make sure to apply the data stream _timestamp definition last when creating an index.
  • I'm okay with the _timestamp naming but I'm curious what others think. Related to the first point, users could see this and think it looks interesting, like something they should be able to configure and use in searches.

return;
}

MappedFieldType fieldType = lookup.get(fieldName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could validate the actual mapping config itself, instead of the field type (which is the object used for searches/ aggregations). That way we could ensure the mapping doesn't contain other options that would be harmful: null_value and ignore_malformed.

}

@Override
protected void parseCreateField(ParseContext context) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw an error here, no one should try to write to the _timestamp field.

}
}

private final String fieldName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the name path instead of fieldName. This would match the naming for field aliases.


private final String fieldName;

private TimestampFieldMapper(FieldType fieldType, MappedFieldType mappedFieldType, String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could override mergeOptions and throw an error if the new field_name doesn't match. This includes the case of updating a null source path to a non-null one. I think currently we won't throw an error but just silently ignore the new mapping definition.


// For now the field shouldn't be useable in searches.
// In the future it should act as an alias to the actual data stream timestamp field.
public static final class DataStreamTimestampFieldFieldType extends MappedFieldType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be TimestampFieldType ?

dataStream.getTimeStampField().insertTimestampFieldMapping(mappings);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where the data stream does not exist, but we are still creating a new index within it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in the case when data stream doesn't exist yet (has to be created with the first backing index), then the timestamp field should be resolved from composable index template. I added comments here to make this clearer.

@martijnvg
Copy link
Member Author

Currently a user could configure a _timestamp field directly in the mappings. It's confusing, but I don't see a way around it. I also don't think it would cause problems, since we make sure to apply the data stream _timestamp definition last when creating an index.

Agreed, we manage this meta field mapper, so it is hidden to the user. Also I think for now, let's wait with documenting this new meta field mapper until we find a real world use case for this.

@martijnvg martijnvg added :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC and removed :Search Foundations/Mapping Index mappings, including merging and defining field types labels Jun 30, 2020
@elasticmachine elasticmachine added Team:Security Meta label for security team and removed Team:Search Meta label for search team labels Jun 30, 2020
@martijnvg martijnvg added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Jun 30, 2020
@elasticmachine elasticmachine added Team:Search Meta label for search team and removed Team:Security Meta label for security team labels Jun 30, 2020
@elastic elastic deleted a comment from elasticmachine Jun 30, 2020
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me from a code level. I am still unsure about the fact that _timestamp can be accidentally configured, and also the naming.

}

// Validate whether disallowed mapping attributes have been specified on the field this meta field refers to:
try (XContentBuilder builder = jsonBuilder()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it'd be cleaner to avoid working with the raw JSON and just check specific attributes like 'null value' and 'ignore malformed'. We can add getters to DateFieldMapper if it makes it easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a catch all validation, if there is any attribute specified but type, meta or format then fail.
I prefer to keep this logic in the case we ever forget to check an (future new) attribute.
The idea is that for least for now in data streams we want to lock down what can be configured on the
timestamp field.

However I can add specific checks for null value and ignore malformed for clearer error messages.
Like the check for whether the field is searchable and has doc values.


Mapper mapper = lookup.getMapper(path);
if (mapper == null) {
throw new IllegalArgumentException("timestamp meta field's field_name [" + path + "] points to a non existing field");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these messages are intended to be end-user friendly? So something like this could be clearer: "The configured timestamp field does not exist."


if (DateFieldMapper.CONTENT_TYPE.equals(mapper.typeName()) == false &&
DateFieldMapper.DATE_NANOS_CONTENT_TYPE.equals(mapper.typeName()) == false) {
throw new IllegalArgumentException("timestamp meta field's field_name [" + path + "] is of type [" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thoughts for these other messages, users probably won't understand what "timestamp meta field's field_name ..." means.

@@ -391,6 +391,12 @@ private synchronized DocumentMapper internalMerge(DocumentMapper mapper, MergeRe
}
checkIndexSortCompatibility(indexSettings.getIndexSortConfig(), hasNested);

for (MetadataFieldMapper metadataFieldMapper : metadataMappers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this logic into MapperMergeValidator#validateFieldReferences to keep all the cross-field validation in one place.

@martijnvg
Copy link
Member Author

This looks okay to me!

Thanks for reviewing @jtibshirani!

I am still unsure about the fact that _timestamp can be accidentally configured, and also the naming, but am happy to go with what you + others think is best.

@jimczi Do you have an opinion about the current name of this new meta field and that it can be configured manually outside of data streams?

@jimczi
Copy link
Contributor

jimczi commented Jul 1, 2020

Do you have an opinion about the current name of this new meta field and that it can be configured manually outside of data streams?

I don't think naming is an issue since this field is disabled by default. It is unclear if this field could be useful outside of data streams but imo it is used as a building block to implement a high level feature (we do the same for rollups) so it's ok to expose it in the mapping directly.

@martijnvg martijnvg merged commit 001b3fb into elastic:master Jul 2, 2020
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Jul 3, 2020
Backport of elastic#58582 to 7.x branch.

This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field and
that the timestamp field mapping only has `type`, `meta` or `format` attributes configured.
Other attributes can affect the guarantee that an index with this meta field mapper has a
useable timestamp field.

The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.

Relates to elastic#53100
martijnvg added a commit that referenced this pull request Jul 6, 2020
Backport of #58582 to 7.x branch.

This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field and
that the timestamp field mapping only has `type`, `meta` or `format` attributes configured.
Other attributes can affect the guarantee that an index with this meta field mapper has a
useable timestamp field.

The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.

Relates to #53100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Data Management Meta label for data/management team Team:Search Meta label for search team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants