Data Streams #53100

martijnvg · 2020-03-04T12:02:28Z

Update: the description is out dated, for more information take a look at the unreleased data streams docs.

The meta issue will track the development of a new feature named data streams.

Background

Data streams are targeted towards time-based data sources and enable us to solve bootstrap problems when indexing via a write alias (logstash-plugins/logstash-output-elasticsearch#858). Today aliases have some deficiencies in how they are implemented in Elasticsearch; namely they are not first-class concepts, making them confusing to use. Aliases have a broad type of usages, whereas data streams will be a solution focused for time-based data sources. Data streams should be a first-class concept, but should also be non intrusive.

Concept

A data stream formalizes the notion of a stream of data for time based data sources. Data streams groups indices from the same time-based data source together as an opaque container. A data stream keeps track of a list of indices ordered by generation. A generation starts at 0 and each time the stream is rolled over the generation is incremented by one. Writes are forwarded to the index with the highest generation (last index). Searches and other multi-index APIs forward requests to all indices that are part of a data stream (this is similar to how aliases are resolved in these APIs).

Because data streams are aimed at time-series data sources, a date field is required and must be identified as the "timestamp" for the documents in the data stream. This will enable Kibana to detect that is dealing with time-series data automatically, and we can internally apply some optimizations (e.g., automatically sorting on the timestamp field).

Indices that are contained with a data stream are hidden. The idea is that users interact with the data stream as much as possible and not directly with the indices that are contained within a data stream.

Data streams only accept append-only writes (index requests with op_type=create). Deletes and updates are rejected. If specific documents need to be updated or deleted then these operations should happen via the index these documents reside in. The reason that these write operations are rejected via a data stream is that these operations work as expected until a data stream is rolled over and then these operations result in 404 errors. Therefore it is better to reject these operations consistently.

The rollover API needs to understand how to handle a data stream. The data stream’s generation needs to be incremented and a new hidden index needs to be created atomically. The name of the index is based on the name of the data stream and its current generation, which looks like this: [datastream-name]-[datastream-generation].

A data stream can be created with the create data stream API and remove via the delete data stream API.

It should also be possible to reserve a namespace to be a data stream before actually creating the data stream. For this data streams will depend on index templates v2. Index templates will get an additional setting, named data_stream, which will create a data stream with the name of what should be the concrete index and a hidden index:

A user creates an index template (in v2 format) with a desired index pattern, mappings, index settings and sets the data_stream setting to true.
This user starts ingesting data and the auto create index functionality kicks in. The previously created template matches, but instead of creating an index the following is created:
- A data stream with the targeted index name as name.
- A hidden index with the following name: [data-stream-name]-0.
- The hidden index is added to the list of indices of the data stream and the current generation is set to 0.

Additionally, data streams will be integrated into Kibana. For example, Kibana can automatically generate index patterns based on data streams, and identity the timestamp field.

Data Streams and ILM

The main change will be that it will no longer be required to configure index.lifecycle.rollover_alias setting for ilm. ILM can automatically figure out if an index is part of a datastream and act accordingly. An index can only be part of a single data stream, the user doesn’t create the index and the index is hidden, because of this clear structure ILM can make assumptions and doesn’t need additional configuration. Compared to using aliases (even with alias templates) this wouldn’t be true and this is a big upside of using datastreams.

ILM should also be able to update data streams atomically. For example in the context of ILM’s shrink action, the data stream should be updated to not refer to the un-shrunken index and refer to the shrunken index. For the rest, ILM should be able work as it is today.

Integration with security

TBD

APIs

The APIs and they way data streams are used from other APIs is not final and may change in the future.

Index expressions and data streams in APIs

Data streams should be taken into account when resolving an index expression. Also the API that resolves an index expressions is important. If a multi-index API tries to resolve the name of a data stream then all the indices of a stream should be resolved and if a write API tries to resolve the name of a data stream then only the latest index should be resolved.

The following write APIs will resolve a data stream to the latest index: bulk API and index API.
The following multi index APIs should resolve a data stream to all indices it contains: search, msearch, field caps and eql search.

Single document read APIs should fail when being used via a data stream. The following APIs fall into this category: explain, get, mget, termvector.

There are many admin APIs that are multi index. These apis should be able to resolve data streams and resolve to the latest hidden index backed by a data stream. Examples of these APIs are: put mapping, get mapping, get settings or get index.

The rollover API both accepts a write alias and a data stream.

Get index API

The get index api should if an index is part of a data stream include what data stream it is part of.

Create data stream API

Request:

PUT /_data_stream/[name]
{
    "timestamp_field": ...
}

The create data stream API allows to create new data stream and the first backing index. This API creates a new data stream with the provided name and the provided timestamp field. The generation is set to 0. The backing index is created with the following name: '[data-stream-name]-000000'. The settings & mappings originate from any index template that match.

If an existing data stream, index or alias already exists with the same name as the provided data stream name then the create data stream will return with an error. Also if indices or aliases exist with the same prefix in the name as the provided data stream name then an error is returned as well.

Get data streams API

Request:

GET /_data_streams/[name]

Returns the list of data streams based on the specified name and for each data stream additional meta data is included. (for example list of backing indices and current generation). If no name is provided then all data streams are returned. Also wildcard expressions are supported.

Delete data stream API

Request:

DELETE /_data_stream/[name]

Deletes the specified data stream. Also the indices that are part of the data stream are removed.

Updating a data stream

TBD

A data stream can not be updated to include allow system or indices that are already part of data stream.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-04T12:02:30Z

Pinging @elastic/es-core-features (:Core/Features/Indices APIs)

This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to elastic#53100

* Initial data stream commit This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to #53100 * fixed hlrc test * ignore bwc until this change has been backported to 7.x branch * changed data stream apis to be a cluster based action. before this commit the data steams api were indices based actions, but data streams aren't indices, data streams encapsulates indices, but are indices themselves. It is a cluster level attribute, and therefor cluster based action fits best for now. Perhaps in the future we will have data stream based actions and then this would be a right fit for the data stream crud apis. * this should have been part of the previous commit * fixed yaml test * Also add feature flag in other modules that run the yaml test if a release build is executed * Reverted the commits that make data stream a cluster based api This reverts commit e362eeb. * Make data stream crud apis work like a indices based api. * renamed timestamp field * fixed compile error after merging in master * fixed merge mistake * moved setting system property * applied review comments

This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to elastic#53100

This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to #53100

In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Introduced a `AliasOrIndex.Type` enum to indicate what a `AliasOrIndex` instance is. * Replaced the `isAlias()`` method that returns a boolean with the `getType()`` method that returns the new Type enum. * Moved `getWriteIndex()`` up from the `AliasOrIndex.Alias` to the `AliasOrIndex` interface. * Moved `g`etAliasName()`` up from the `AliasOrIndex.Alias` to the `AliasOrIndex` interface and renamed it to `getName()``. * Removed unnecessary casting to `AliasOrIndex.Alias` by just checking the `getType()`` method. Finally `AliasOrIndex` should be renamed to reflect that it can be more than just an index or alias, since in the near future it can also be a data stream. The name AliasOrIndexOrDataStream is not appealing to me. We can rename it to `Namespace`, but that sounds to generic to me. `ResolvedIndicesExpression` sounds better to me, since it reflects more what it is (an expression from api that has been resolved to alias/index/datasteam), but the name itself is a bit on the long side. Relates to elastic#53100

martijnvg · 2020-03-25T14:13:31Z

I've updated the create data stream api section. The first backing index is now created as part of the create data stream api call and no longer created lazily. The assumption for this api, is that if data streams are explicitly created then the new data stream should be ready to be used. Index templates v2 will provide the ability to lazily create a data stream.

In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Renamed `AliasOrIndex` to `IndexAbstraction`. * Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is. * Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum. * Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface. * Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`. * Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method. Relates to #53100

In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Renamed `AliasOrIndex` to `IndexAbstraction`. * Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is. * Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum. * Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface. * Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`. * Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method. Relates to elastic#53100

Backport of #53982 In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Renamed `AliasOrIndex` to `IndexAbstraction`. * Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is. * Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum. * Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface. * Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`. * Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method. Relates to #53100

so that it can used in the next minor release (7.9.0). Closes #53100

so that it can used in the next minor release (7.9.0). Closes elastic#53100

so that it can used in the next minor release (7.9.0). Backport of #59504 to 7.x branch. Closes #53100

weizijun · 2021-07-05T12:53:58Z

hi, @martijnvg , as @timestamp is a part of datastream, when writing data to datastream, can the backing index made by the the actual content of @timestamp instead of the latest index?

Such as, @timestamp=1625400000000(2021-07-04 20:00:00) , doc write into .ds-my-data-stream-2021.07.04-000001 not the latest index, eg: .ds-my-data-stream-2021.07.05-000002

danhermann · 2021-07-06T19:32:02Z

@weizijun, that could be nice for certain use cases, but there are a number of difficulties that arise when attempting to determine which backing index should receive the document. Two of the biggest issues are:

The @timestamp does not uniquely identify the correct backing index. In your example above, the timestamp of 2021-07-04 20:00:00 could potentially go into any of the .ds-my-data-stream-2021.07.04-* indices as well as the .ds-my-data-stream-2021.07.03-* or even earlier indices since backing indices are rolled based on size or document count, not on day boundaries. This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.
The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy. This would mean that a new index would need to be created (undesirable because users usually expect historical data to be managed by ILM), the document would need to be indexed into the current write index (undesirable since the point of this feature would be to group documents by their @timestamp values), or the indexing request for the document would need to be rejected (undesirable because documents with older timestamps should still be accepted).

obogobo · 2021-07-06T23:57:45Z

Just chiming in here because I have a personal interest in Elasticsearch and this feature could be of great use to the organization I currently work for 😄

This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.

Is this something the existing can_match index metadata could help with, as opposed to a full regular search? e.g. keep running statistics cached for min/max @timestamp per index to determine the correct backing index for the document during indexing. Net-new functionality I imagine, but the gains in search performance due to correctly picking one of potentially dozens of ILM rolled-over data stream indices could be substantial under optimal circumstances! Those being: minimal @timestamp lag / range overlap per index, during regular operating conditions.

The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy.

That's a wrench... I imagine if the user chose that behavior they'd probably expect a DocumentRejectedException and could handle the response application-side, with a couple default strategies (discard, reindex @current, etc.) as options for logstash/filebeat,etc..? Not an easy lift though. It would be cool if alternatively, this feature could have an option to use "ingest time" instead of user-specified document @timestamp, as the backing index value would then be monotonically increasing. Maybe a worthwhile tradeoff in some circumstances, maybe not. Applications could expand the search range to account for indexing latency as needed, and realize greater search performance when not necessary.

Anyways, thanks for your contributions to Elasticsearch!

danhermann · 2021-07-07T19:24:41Z

This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.

Is this something the existing can_match index metadata could help with, as opposed to a full regular search? e.g. keep running statistics cached for min/max @timestamp per index to determine the correct backing index for the document during indexing. Net-new functionality I imagine, but the gains in search performance due to correctly picking one of potentially dozens of ILM rolled-over data stream indices could be substantial under optimal circumstances! Those being: minimal @timestamp lag / range overlap per index, during regular operating conditions.

The can_match phase does help a lot, but there's still a significant cost involved in searching for the appropriate backing index rather than indexing directly into the current write index.

The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy.

That's a wrench... I imagine if the user chose that behavior they'd probably expect a DocumentRejectedException and could handle the response application-side, with a couple default strategies (discard, reindex @current, etc.) as options for logstash/filebeat,etc..? Not an easy lift though. It would be cool if alternatively, this feature could have an option to use "ingest time" instead of user-specified document @timestamp, as the backing index value would then be monotonically increasing. Maybe a worthwhile tradeoff in some circumstances, maybe not. Applications could expand the search range to account for indexing latency as needed, and realize greater search performance when not necessary.

If I understand your "index time" suggestion above correctly, that's how data streams work now since all documents are indexed into the current write index regardless of the value of their @timestamp fields.

I can see the value in what you are proposing for data streams and you might open an enhancement request for it. I just wanted to be clear that we had considered this behavior for data streams and chose not to implement it (at least, not so far) because of the difficulties I described above.

weizijun · 2021-07-12T05:00:33Z

@danhermann Thank you for your reply!

The @timestamp does not uniquely identify the correct backing index. In your example above, the timestamp of 2021-07-04 20:00:00 could potentially go into any of the .ds-my-data-stream-2021.07.04-* indices as well as the .ds-my-data-stream-2021.07.03-* or even earlier indices since backing indices are rolled based on size or document count, not on day boundaries. This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.

Can this be designed like this? Because it is append only, the latest index in every day's indices supports writing, such as:

.ds-my-data-stream-2021.07.04-000001
.ds-my-data-stream-2021.07.04-000002
.ds-my-data-stream-2021.07.04-000003
.ds-my-data-stream-2021.07.05-000001
.ds-my-data-stream-2021.07.05-000002
.ds-my-data-stream-2021.07.06-000001

Documents can be written to the latest index every day:

.ds-my-data-stream-2021.07.04-000003
.ds-my-data-stream-2021.07.05-000002
.ds-my-data-stream-2021.07.06-000001

The rollover rule of each day index is max_docs or max_size

The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy. This would mean that a new index would need to be created (undesirable because users usually expect historical data to be managed by ILM), the document would need to be indexed into the current write index (undesirable since the point of this feature would be to group documents by their @timestamp values), or the indexing request for the document would need to be rejected (undesirable because documents with older timestamps should still be accepted).

The index can be set with a expire, such as a 7-day expire. If write a document with timestamp is 7 days ago, it is directly discarded or an exception is thrown.
Or add an expire-before-index, the expired data is written into the index, for example, now it is 07.12, the expiration day is 07.05, and the expired data can be written to . ds-my-data-stream-2021.07.05-before index.
Datastream can also consider supporting custom dateFormat suffixes, such as supporting day, month, and year format, e.g:.ds-my-data-stream-2021.07，.ds-my-data-stream-2021

danhermann · 2021-07-12T12:44:31Z

Thank you, @weizijun. My intent in responding above was to communicate that we had considered that behavior for data streams and chose not to build it in the initial version due to its complexity. With sufficient effort, I expect it could be added. If that is of interest to you, I would suggest opening a feature request in Github for it.

weizijun · 2021-07-13T03:22:49Z

Thank you, @weizijun. My intent in responding above was to communicate that we had considered that behavior for data streams and chose not to build it in the initial version due to its complexity. With sufficient effort, I expect it could be added. If that is of interest to you, I would suggest opening a feature request in Github for it.

Thank you, I will try!

martijnvg added Meta :Data Management/Indices APIs APIs to create and manage indices and templates labels Mar 4, 2020

This was referenced Mar 4, 2020

Alias templates #51995

Closed

Consider preventing indexing to concrete indexes when ILM is enabled #52947

Closed

pcsanwald added the Top Ask label Mar 11, 2020

dakrone mentioned this issue Mar 12, 2020

Add action that checks whether a specific ILM policy is compatible with indices #47240

Closed

martijnvg mentioned this issue Mar 17, 2020

Initial data stream commit #53666

Merged

tonymeehan added the Dependency:Endpoint label Mar 19, 2020

danhermann mentioned this issue Mar 20, 2020

Cluster state and CRUD operations for data streams #53877

Merged

martijnvg mentioned this issue Mar 23, 2020

Backport: initial data stream commit #53959

Merged

martijnvg mentioned this issue Mar 23, 2020

Refactor AliasOrIndex abstraction. #53982

Merged

This was referenced Mar 24, 2020

[7.x] Cluster state and CRUD operations for data streams #54073

Merged

Validation for data stream creation #54083

Merged

This was referenced Mar 27, 2020

[7.x] Validation for data stream creation #54344

Merged

Test to enforce response to invalid data stream names #54351

Merged

martijnvg mentioned this issue Mar 30, 2020

Refactor AliasOrIndex abstraction. #54394

Merged

This was referenced Mar 30, 2020

Create first backing index when creating data stream #54467

Merged

Rename examples in ILM guide to avoid association with data streams #54502

Merged

lizozom mentioned this issue Mar 31, 2020

[Meta] Index Patterns Roadmap elastic/kibana#61760

Closed

1 task

This was referenced Jul 14, 2020

[7.x] Adds write_index_only option to put mapping API #59539

Merged

Update serialization versions for PutMappingRequest #59547

Merged

martijnvg closed this as completed in #59504 Jul 14, 2020

martijnvg added a commit that referenced this issue Jul 14, 2020

Remove data stream feature flag, (#59504)

c27dc5f

so that it can used in the next minor release (7.9.0). Closes #53100

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jul 14, 2020

Remove data stream feature flag, (elastic#59504)

ac8aa35

so that it can used in the next minor release (7.9.0). Closes elastic#53100

martijnvg mentioned this issue Jul 14, 2020

Remove data stream feature flag #59572

Merged

martijnvg added a commit that referenced this issue Jul 14, 2020

Remove data stream feature flag (#59572)

35ae3d1

so that it can used in the next minor release (7.9.0). Backport of #59504 to 7.x branch. Closes #53100

danhermann added the v7.9.0 label Jul 15, 2020

danhermann mentioned this issue Jul 15, 2020

Move REST specs for data streams #59634

Merged

This was referenced Jul 27, 2020

[7.9] Mark data stream APIs as stable (#59860) #60205

Merged

[7.x] Mark data stream APIs as stable (#59860) #60206

Merged

This was referenced Aug 4, 2020

[7.9] Un-mute data stream REST test (#60120) #60657

Merged

[7.9] [DOCS] Update get data stream API #60792

Merged

[7.x] [DOCS] Update get data stream API #60862

Merged

[7.x] Un-mute data stream REST test (#60120) #60939

Merged

pmuellr mentioned this issue Sep 6, 2023

Accommodate .alerts-* being normal (not fast, slow) Serverless indices elastic/kibana#163953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Streams #53100

Data Streams #53100

martijnvg commented Mar 4, 2020 •

edited

Loading

elasticmachine commented Mar 4, 2020

martijnvg commented Mar 25, 2020

weizijun commented Jul 5, 2021 •

edited

Loading

danhermann commented Jul 6, 2021

obogobo commented Jul 6, 2021 •

edited

Loading

danhermann commented Jul 7, 2021

weizijun commented Jul 12, 2021

danhermann commented Jul 12, 2021

weizijun commented Jul 13, 2021

Data Streams #53100

Data Streams #53100

Comments

martijnvg commented Mar 4, 2020 • edited Loading

Background

Concept

Data Streams and ILM

Integration with security

APIs

Index expressions and data streams in APIs

Get index API

Create data stream API

Get data streams API

Delete data stream API

Updating a data stream

elasticmachine commented Mar 4, 2020

martijnvg commented Mar 25, 2020

weizijun commented Jul 5, 2021 • edited Loading

danhermann commented Jul 6, 2021

obogobo commented Jul 6, 2021 • edited Loading

danhermann commented Jul 7, 2021

weizijun commented Jul 12, 2021

danhermann commented Jul 12, 2021

weizijun commented Jul 13, 2021

martijnvg commented Mar 4, 2020 •

edited

Loading

weizijun commented Jul 5, 2021 •

edited

Loading

obogobo commented Jul 6, 2021 •

edited

Loading