-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data stream timestamp validation via metadata field mapper #58582
Add data stream timestamp validation via metadata field mapper #58582
Conversation
This commit adds a new metadata field mapper that validates, that a document has exactly a single timestamp value in the data stream timestamp field. The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever a new backing index of a data stream is created. Relates to elastic#53100
…sts were failing because these tests assumed any field longer then 20 chars is too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying this road @martijnvg . I left some questions and comments but I like where this is going.
@Override | ||
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | ||
boolean includeDefaults = params.paramAsBoolean("include_defaults", false); | ||
if (includeDefaults == false && fieldName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we always return when the fieldName is null ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with doing that, I just thought that if includeDefaults
is true
then that the fieldName should be returned anyway, irregardless whether it has a value. But I guess it is ok to ignore includeDefaults
if fieldName
is null?
} | ||
|
||
// For now the field shouldn't be useable in searches. | ||
// In the future it should act as an alias to the actual data stream timestamp field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what this would look like though. It's very similar to the alias field so I wonder what @jtibshirani thinks of how this should work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be similar to the code in FieldTypeLookup
that checks FieldAliasMapper#path()
. I think we can maybe piggyback on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field alias framework is narrow in scope and unfortunately isn't very extensible. @romseygeek is working on some refactors that will let field mappers delegate to other field types to expose search functionality. I think it will be cleaner and more forward-looking to collaborate on that.
Stepping back though, I don't know if this timestamp mapper should serve as a search-time alias for the delegate field. To me it'd be fine if it were just a hidden mapper that performed validation and recorded the data stream timestamp in the mappings. Generally we lean heavily on ECS to provide a consistent set of field names across individual schemas, and ECS has a convenient @timestamp
field at the top level. Users can also define field aliases to allow for searching across different data streams. Does the 'timestamp' field need special treatment?
One advantage I can think of is that we could allow the delegate timestamp field to change for future indices in a data stream, and searches keep working on _timestamp
. This doesn't seem like a common case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we lean heavily on ECS to provide a consistent set of field names across individual schemas, and ECS has a convenient @timestamp field at the top level. Users can also define field aliases to allow for searching across different data streams. Does the 'timestamp' field need special treatment?
In this case, just using @timestamp
is imo preferable and sorting by meta field _timestamp
would not be needed.
One advantage I can think of is that we could allow the delegate timestamp field to change for future indices in a data stream, and searches keep working on _timestamp.
I think this becomes more useful, in cases existing indices get imported into a data stream, which is something that we would want to support down the line.
However for now, I think there is no need for this, as the initial targeted users use ECS. I think we will be looking into this somewhere in the not so distant future.
import java.util.Iterator; | ||
import java.util.Map; | ||
|
||
public class DataStreamTimestampFieldMapper extends MetadataFieldMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to reference the data stream naming ? It's a generic timestamp metadata field, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if we see a use for it outside of datastreams.
Data steams targets append-only time based data sources. I think use cases that are not append-only, will keep on using indices + alias pattern. I think this usecase may also benefit from having a _timestamp
metadata field.
// not configured, so skip the validation | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add validation in the MapperService
that the target field is of type date
and that it is indexed and has doc values. I know that you validate this in the data stream code but it would be cleaner/safer to do it directly in the MapperService since the metadata field could be used outside of data streams ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do it directly in the MapperService since the metadata field could be used outside of data streams ?
If this metadata field mapper has value outside data streams then yes it make sense to add the validation in MapperService
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of moving the validation down into the mapping code even if the field mapper isn't used outside of data streams. It's really nice to have all mapping validation embedded in the mappers/ mapping logic instead of some living outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice idea from @jimczi!
throw new IllegalArgumentException("data stream timestamp field [" + fieldName + "] is missing"); | ||
} | ||
|
||
long numberOfValues = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could instead assert that the field type has at least one docvalues field? This would be more robust if we decide to stop indexing points (#48665).
// not configured, so skip the validation | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of moving the validation down into the mapping code even if the field mapper isn't used outside of data streams. It's really nice to have all mapping validation embedded in the mappers/ mapping logic instead of some living outside.
import java.util.Iterator; | ||
import java.util.Map; | ||
|
||
public class DataStreamTimestampFieldMapper extends MetadataFieldMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
// For now the field shouldn't be useable in searches. | ||
// In the future it should act as an alias to the actual data stream timestamp field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field alias framework is narrow in scope and unfortunately isn't very extensible. @romseygeek is working on some refactors that will let field mappers delegate to other field types to expose search functionality. I think it will be cleaner and more forward-looking to collaborate on that.
Stepping back though, I don't know if this timestamp mapper should serve as a search-time alias for the delegate field. To me it'd be fine if it were just a hidden mapper that performed validation and recorded the data stream timestamp in the mappings. Generally we lean heavily on ECS to provide a consistent set of field names across individual schemas, and ECS has a convenient @timestamp
field at the top level. Users can also define field aliases to allow for searching across different data streams. Does the 'timestamp' field need special treatment?
One advantage I can think of is that we could allow the delegate timestamp field to change for future indices in a data stream, and searches keep working on _timestamp
. This doesn't seem like a common case though.
Pinging @elastic/es-core-features (:Core/Features/Data streams) |
Pinging @elastic/es-search (:Search/Mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is looking good to me. I added some detailed review comments.
I had a couple last high-level questions:
- Currently a user could configure a
_timestamp
field directly in the mappings. It's confusing, but I don't see a way around it. I also don't think it would cause problems, since we make sure to apply the data stream_timestamp
definition last when creating an index. - I'm okay with the
_timestamp
naming but I'm curious what others think. Related to the first point, users could see this and think it looks interesting, like something they should be able to configure and use in searches.
return; | ||
} | ||
|
||
MappedFieldType fieldType = lookup.get(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could validate the actual mapping config itself, instead of the field type (which is the object used for searches/ aggregations). That way we could ensure the mapping doesn't contain other options that would be harmful: null_value
and ignore_malformed
.
} | ||
|
||
@Override | ||
protected void parseCreateField(ParseContext context) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should throw an error here, no one should try to write to the _timestamp
field.
} | ||
} | ||
|
||
private final String fieldName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the name path
instead of fieldName
. This would match the naming for field aliases.
|
||
private final String fieldName; | ||
|
||
private TimestampFieldMapper(FieldType fieldType, MappedFieldType mappedFieldType, String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could override mergeOptions
and throw an error if the new field_name
doesn't match. This includes the case of updating a null source path to a non-null one. I think currently we won't throw an error but just silently ignore the new mapping definition.
|
||
// For now the field shouldn't be useable in searches. | ||
// In the future it should act as an alias to the actual data stream timestamp field. | ||
public static final class DataStreamTimestampFieldFieldType extends MappedFieldType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be TimestampFieldType
?
dataStream.getTimeStampField().insertTimestampFieldMapping(mappings); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where the data stream does not exist, but we are still creating a new index within it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in the case when data stream doesn't exist yet (has to be created with the first backing index), then the timestamp field should be resolved from composable index template. I added comments here to make this clearer.
Agreed, we manage this meta field mapper, so it is hidden to the user. Also I think for now, let's wait with documenting this new meta field mapper until we find a real world use case for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks okay to me from a code level. I am still unsure about the fact that _timestamp
can be accidentally configured, and also the naming.
} | ||
|
||
// Validate whether disallowed mapping attributes have been specified on the field this meta field refers to: | ||
try (XContentBuilder builder = jsonBuilder()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it'd be cleaner to avoid working with the raw JSON and just check specific attributes like 'null value' and 'ignore malformed'. We can add getters to DateFieldMapper
if it makes it easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a catch all validation, if there is any attribute specified but type, meta or format then fail.
I prefer to keep this logic in the case we ever forget to check an (future new) attribute.
The idea is that for least for now in data streams we want to lock down what can be configured on the
timestamp field.
However I can add specific checks for null value and ignore malformed for clearer error messages.
Like the check for whether the field is searchable and has doc values.
|
||
Mapper mapper = lookup.getMapper(path); | ||
if (mapper == null) { | ||
throw new IllegalArgumentException("timestamp meta field's field_name [" + path + "] points to a non existing field"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these messages are intended to be end-user friendly? So something like this could be clearer: "The configured timestamp field does not exist."
|
||
if (DateFieldMapper.CONTENT_TYPE.equals(mapper.typeName()) == false && | ||
DateFieldMapper.DATE_NANOS_CONTENT_TYPE.equals(mapper.typeName()) == false) { | ||
throw new IllegalArgumentException("timestamp meta field's field_name [" + path + "] is of type [" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thoughts for these other messages, users probably won't understand what "timestamp meta field's field_name ..." means.
@@ -391,6 +391,12 @@ private synchronized DocumentMapper internalMerge(DocumentMapper mapper, MergeRe | |||
} | |||
checkIndexSortCompatibility(indexSettings.getIndexSortConfig(), hasNested); | |||
|
|||
for (MetadataFieldMapper metadataFieldMapper : metadataMappers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move this logic into MapperMergeValidator#validateFieldReferences
to keep all the cross-field validation in one place.
Thanks for reviewing @jtibshirani!
@jimczi Do you have an opinion about the current name of this new meta field and that it can be configured manually outside of data streams? |
I don't think naming is an issue since this field is disabled by default. It is unclear if this field could be useful outside of data streams but imo it is used as a building block to implement a high level feature (we do the same for rollups) so it's ok to expose it in the mapping directly. |
Backport of elastic#58582 to 7.x branch. This commit adds a new metadata field mapper that validates, that a document has exactly a single timestamp value in the data stream timestamp field and that the timestamp field mapping only has `type`, `meta` or `format` attributes configured. Other attributes can affect the guarantee that an index with this meta field mapper has a useable timestamp field. The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever a new backing index of a data stream is created. Relates to elastic#53100
Backport of #58582 to 7.x branch. This commit adds a new metadata field mapper that validates, that a document has exactly a single timestamp value in the data stream timestamp field and that the timestamp field mapping only has `type`, `meta` or `format` attributes configured. Other attributes can affect the guarantee that an index with this meta field mapper has a useable timestamp field. The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever a new backing index of a data stream is created. Relates to #53100
This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field.
The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.
This is a different variant of #58119, which also tries to add timestamp field validation when
ingesting documents. This implementation is cleaner than in #58119, because the validation
is encapsulated in the
DataStreamTimestampFieldMapper
class and the data streamtimestamp field name is provided to it as mapping configuration. Whereas in #58119
hard coded validation logic was added in
ParseContext#postParse(...)
andDateFieldMapper
,also a data stream timestamp fieldname field/variable was added in many other classes in
order to get the right information in the right place.
Relates to #53100