Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Streams #53100

Closed
martijnvg opened this issue Mar 4, 2020 · 14 comments · Fixed by #59504
Closed

Data Streams #53100

martijnvg opened this issue Mar 4, 2020 · 14 comments · Fixed by #59504
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates Dependency:Endpoint Meta release highlight Team:Data Management Meta label for data/management team Top Ask v7.9.0

Comments

@martijnvg
Copy link
Member

martijnvg commented Mar 4, 2020

Update: the description is out dated, for more information take a look at the unreleased data streams docs.

The meta issue will track the development of a new feature named data streams.

Background

Data streams are targeted towards time-based data sources and enable us to solve bootstrap problems when indexing via a write alias (logstash-plugins/logstash-output-elasticsearch#858). Today aliases have some deficiencies in how they are implemented in Elasticsearch; namely they are not first-class concepts, making them confusing to use. Aliases have a broad type of usages, whereas data streams will be a solution focused for time-based data sources. Data streams should be a first-class concept, but should also be non intrusive.

Concept

A data stream formalizes the notion of a stream of data for time based data sources. Data streams groups indices from the same time-based data source together as an opaque container. A data stream keeps track of a list of indices ordered by generation. A generation starts at 0 and each time the stream is rolled over the generation is incremented by one. Writes are forwarded to the index with the highest generation (last index). Searches and other multi-index APIs forward requests to all indices that are part of a data stream (this is similar to how aliases are resolved in these APIs).

Because data streams are aimed at time-series data sources, a date field is required and must be identified as the "timestamp" for the documents in the data stream. This will enable Kibana to detect that is dealing with time-series data automatically, and we can internally apply some optimizations (e.g., automatically sorting on the timestamp field).

Indices that are contained with a data stream are hidden. The idea is that users interact with the data stream as much as possible and not directly with the indices that are contained within a data stream.

Data streams only accept append-only writes (index requests with op_type=create). Deletes and updates are rejected. If specific documents need to be updated or deleted then these operations should happen via the index these documents reside in. The reason that these write operations are rejected via a data stream is that these operations work as expected until a data stream is rolled over and then these operations result in 404 errors. Therefore it is better to reject these operations consistently.

The rollover API needs to understand how to handle a data stream. The data stream’s generation needs to be incremented and a new hidden index needs to be created atomically. The name of the index is based on the name of the data stream and its current generation, which looks like this: [datastream-name]-[datastream-generation].

A data stream can be created with the create data stream API and remove via the delete data stream API.

It should also be possible to reserve a namespace to be a data stream before actually creating the data stream. For this data streams will depend on index templates v2. Index templates will get an additional setting, named data_stream, which will create a data stream with the name of what should be the concrete index and a hidden index:

  • A user creates an index template (in v2 format) with a desired index pattern, mappings, index settings and sets the data_stream setting to true.
  • This user starts ingesting data and the auto create index functionality kicks in. The previously created template matches, but instead of creating an index the following is created:
    • A data stream with the targeted index name as name.
    • A hidden index with the following name: [data-stream-name]-0.
    • The hidden index is added to the list of indices of the data stream and the current generation is set to 0.

Additionally, data streams will be integrated into Kibana. For example, Kibana can automatically generate index patterns based on data streams, and identity the timestamp field.

Data Streams and ILM

The main change will be that it will no longer be required to configure index.lifecycle.rollover_alias setting for ilm. ILM can automatically figure out if an index is part of a datastream and act accordingly. An index can only be part of a single data stream, the user doesn’t create the index and the index is hidden, because of this clear structure ILM can make assumptions and doesn’t need additional configuration. Compared to using aliases (even with alias templates) this wouldn’t be true and this is a big upside of using datastreams.

ILM should also be able to update data streams atomically. For example in the context of ILM’s shrink action, the data stream should be updated to not refer to the un-shrunken index and refer to the shrunken index. For the rest, ILM should be able work as it is today.

Integration with security

TBD

APIs

The APIs and they way data streams are used from other APIs is not final and may change in the future.

Index expressions and data streams in APIs

Data streams should be taken into account when resolving an index expression. Also the API that resolves an index expressions is important. If a multi-index API tries to resolve the name of a data stream then all the indices of a stream should be resolved and if a write API tries to resolve the name of a data stream then only the latest index should be resolved.

The following write APIs will resolve a data stream to the latest index: bulk API and index API.
The following multi index APIs should resolve a data stream to all indices it contains: search, msearch, field caps and eql search.

Single document read APIs should fail when being used via a data stream. The following APIs fall into this category: explain, get, mget, termvector.

There are many admin APIs that are multi index. These apis should be able to resolve data streams and resolve to the latest hidden index backed by a data stream. Examples of these APIs are: put mapping, get mapping, get settings or get index.

The rollover API both accepts a write alias and a data stream.

Get index API

The get index api should if an index is part of a data stream include what data stream it is part of.

Create data stream API

Request:

PUT /_data_stream/[name]
{
    "timestamp_field": ...
}

The create data stream API allows to create new data stream and the first backing index. This API creates a new data stream with the provided name and the provided timestamp field. The generation is set to 0. The backing index is created with the following name: '[data-stream-name]-000000'. The settings & mappings originate from any index template that match.

If an existing data stream, index or alias already exists with the same name as the provided data stream name then the create data stream will return with an error. Also if indices or aliases exist with the same prefix in the name as the provided data stream name then an error is returned as well.

Get data streams API

Request:

GET /_data_streams/[name]

Returns the list of data streams based on the specified name and for each data stream additional meta data is included. (for example list of backing indices and current generation). If no name is provided then all data streams are returned. Also wildcard expressions are supported.

Delete data stream API

Request:

DELETE /_data_stream/[name]

Deletes the specified data stream. Also the indices that are part of the data stream are removed.

Updating a data stream

TBD

A data stream can not be updated to include allow system or indices that are already part of data stream.

@martijnvg martijnvg added Meta :Data Management/Indices APIs APIs to create and manage indices and templates labels Mar 4, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Indices APIs)

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 12, 2020
This commits adds a data stream feature flag, initial definition of a data stream and
the stubs for the data stream create, delete and get APIs. Also simple serialization
tests are added and a rest test to thest the data stream API stubs.

This is a large amount of code and mainly mechanical, but this commit should be
straightforward to review, because there isn't any real logic.

The data stream transport and rest action are behind the data stream feature flag and
are only intialized if the feature flag is enabled. The feature flag is enabled if
elasticsearch is build as snapshot or a release build and the
'es.datastreams_feature_flag_registered' is enabled.

The integ-test-zip sets the feature flag if building a release build, otherwise
rest tests would fail.

Relates to elastic#53100
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 17, 2020
This commits adds a data stream feature flag, initial definition of a data stream and
the stubs for the data stream create, delete and get APIs. Also simple serialization
tests are added and a rest test to thest the data stream API stubs.

This is a large amount of code and mainly mechanical, but this commit should be
straightforward to review, because there isn't any real logic.

The data stream transport and rest action are behind the data stream feature flag and
are only intialized if the feature flag is enabled. The feature flag is enabled if
elasticsearch is build as snapshot or a release build and the
'es.datastreams_feature_flag_registered' is enabled.

The integ-test-zip sets the feature flag if building a release build, otherwise
rest tests would fail.

Relates to elastic#53100
martijnvg added a commit that referenced this issue Mar 20, 2020
* Initial data stream commit

This commits adds a data stream feature flag, initial definition of a data stream and
the stubs for the data stream create, delete and get APIs. Also simple serialization
tests are added and a rest test to thest the data stream API stubs.

This is a large amount of code and mainly mechanical, but this commit should be
straightforward to review, because there isn't any real logic.

The data stream transport and rest action are behind the data stream feature flag and
are only intialized if the feature flag is enabled. The feature flag is enabled if
elasticsearch is build as snapshot or a release build and the
'es.datastreams_feature_flag_registered' is enabled.

The integ-test-zip sets the feature flag if building a release build, otherwise
rest tests would fail.

Relates to #53100

* fixed hlrc test

* ignore bwc until this change has been backported to 7.x branch

* changed data stream apis to be a cluster based action.

before this commit the data steams api were indices based actions,
but data streams aren't indices, data streams encapsulates indices,
but are indices themselves. It is a cluster level attribute, and
therefor cluster based action fits best for now.

Perhaps in the future we will have data stream based actions and
then this would be a right fit for the data stream crud apis.

* this should have been part of the previous commit

* fixed yaml test

* Also add feature flag in other modules that run the yaml test if a release build is executed

* Reverted the commits that make data stream a cluster based api

This reverts commit e362eeb.

* Make data stream crud apis work like a indices based api.

* renamed timestamp field

* fixed compile error after merging in master

* fixed merge mistake

* moved setting system property

* applied review comments
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 23, 2020
This commits adds a data stream feature flag, initial definition of a data stream and
the stubs for the data stream create, delete and get APIs. Also simple serialization
tests are added and a rest test to thest the data stream API stubs.

This is a large amount of code and mainly mechanical, but this commit should be
straightforward to review, because there isn't any real logic.

The data stream transport and rest action are behind the data stream feature flag and
are only intialized if the feature flag is enabled. The feature flag is enabled if
elasticsearch is build as snapshot or a release build and the
'es.datastreams_feature_flag_registered' is enabled.

The integ-test-zip sets the feature flag if building a release build, otherwise
rest tests would fail.

Relates to elastic#53100
martijnvg added a commit that referenced this issue Mar 23, 2020
This commits adds a data stream feature flag, initial definition of a data stream and
the stubs for the data stream create, delete and get APIs. Also simple serialization
tests are added and a rest test to thest the data stream API stubs.

This is a large amount of code and mainly mechanical, but this commit should be
straightforward to review, because there isn't any real logic.

The data stream transport and rest action are behind the data stream feature flag and
are only intialized if the feature flag is enabled. The feature flag is enabled if
elasticsearch is build as snapshot or a release build and the
'es.datastreams_feature_flag_registered' is enabled.

The integ-test-zip sets the feature flag if building a release build, otherwise
rest tests would fail.

Relates to #53100
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 23, 2020
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.

* Introduced a `AliasOrIndex.Type` enum to indicate what a `AliasOrIndex` instance is.
* Replaced the `isAlias()`` method that returns a boolean with the `getType()`` method that returns the new Type enum.
* Moved `getWriteIndex()`` up from the `AliasOrIndex.Alias` to the `AliasOrIndex` interface.
* Moved `g`etAliasName()`` up from the `AliasOrIndex.Alias` to the `AliasOrIndex` interface and renamed it to `getName()``.
* Removed unnecessary casting to `AliasOrIndex.Alias` by just checking the `getType()`` method.

Finally `AliasOrIndex` should be renamed to reflect that it can be more than just an index or alias, since
in the near future it can also be a data stream. The name AliasOrIndexOrDataStream is not appealing to me.
We can rename it to `Namespace`, but that sounds to generic to me. `ResolvedIndicesExpression` sounds better
to me, since it reflects more what it is (an expression from api that has been resolved to alias/index/datasteam),
but the name itself is a bit on the long side.

Relates to elastic#53100
@martijnvg
Copy link
Member Author

I've updated the create data stream api section. The first backing index is now created as part of the create data stream api call and no longer created lazily. The assumption for this api, is that if data streams are explicitly created then the new data stream should be ready to be used. Index templates v2 will provide the ability to lazily create a data stream.

martijnvg added a commit that referenced this issue Mar 30, 2020
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.

* Renamed `AliasOrIndex` to `IndexAbstraction`.
* Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is.
* Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum.
* Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface.
* Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`.
* Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method.

Relates to #53100
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 30, 2020
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.

* Renamed `AliasOrIndex` to `IndexAbstraction`.
* Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is.
* Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum.
* Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface.
* Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`.
* Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method.

Relates to elastic#53100
martijnvg added a commit that referenced this issue Mar 30, 2020
Backport of #53982

In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.

* Renamed `AliasOrIndex` to `IndexAbstraction`.
* Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is.
* Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum.
* Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface.
* Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`.
* Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method.

Relates to #53100
martijnvg added a commit that referenced this issue Jul 14, 2020
so that it can used in the next minor release (7.9.0).

Closes #53100
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jul 14, 2020
so that it can used in the next minor release (7.9.0).

Closes elastic#53100
martijnvg added a commit that referenced this issue Jul 14, 2020
so that it can used in the next minor release (7.9.0).

Backport of #59504 to 7.x branch.
Closes #53100
@weizijun
Copy link
Contributor

weizijun commented Jul 5, 2021

hi, @martijnvg , as @timestamp is a part of datastream, when writing data to datastream, can the backing index made by the the actual content of @timestamp instead of the latest index?

Such as, @timestamp=1625400000000(2021-07-04 20:00:00) , doc write into .ds-my-data-stream-2021.07.04-000001 not the latest index, eg: .ds-my-data-stream-2021.07.05-000002

@danhermann
Copy link
Contributor

@weizijun, that could be nice for certain use cases, but there are a number of difficulties that arise when attempting to determine which backing index should receive the document. Two of the biggest issues are:

  • The @timestamp does not uniquely identify the correct backing index. In your example above, the timestamp of 2021-07-04 20:00:00 could potentially go into any of the .ds-my-data-stream-2021.07.04-* indices as well as the .ds-my-data-stream-2021.07.03-* or even earlier indices since backing indices are rolled based on size or document count, not on day boundaries. This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.
  • The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy. This would mean that a new index would need to be created (undesirable because users usually expect historical data to be managed by ILM), the document would need to be indexed into the current write index (undesirable since the point of this feature would be to group documents by their @timestamp values), or the indexing request for the document would need to be rejected (undesirable because documents with older timestamps should still be accepted).

@obogobo
Copy link

obogobo commented Jul 6, 2021

Just chiming in here because I have a personal interest in Elasticsearch and this feature could be of great use to the organization I currently work for 😄

  • This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.

Is this something the existing can_match index metadata could help with, as opposed to a full regular search? e.g. keep running statistics cached for min/max @timestamp per index to determine the correct backing index for the document during indexing. Net-new functionality I imagine, but the gains in search performance due to correctly picking one of potentially dozens of ILM rolled-over data stream indices could be substantial under optimal circumstances! Those being: minimal @timestamp lag / range overlap per index, during regular operating conditions.

  • The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy.

That's a wrench... I imagine if the user chose that behavior they'd probably expect a DocumentRejectedException and could handle the response application-side, with a couple default strategies (discard, reindex @current, etc.) as options for logstash/filebeat,etc..? Not an easy lift though. It would be cool if alternatively, this feature could have an option to use "ingest time" instead of user-specified document @timestamp, as the backing index value would then be monotonically increasing. Maybe a worthwhile tradeoff in some circumstances, maybe not. Applications could expand the search range to account for indexing latency as needed, and realize greater search performance when not necessary.

Anyways, thanks for your contributions to Elasticsearch!

@danhermann
Copy link
Contributor

  • This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.

Is this something the existing can_match index metadata could help with, as opposed to a full regular search? e.g. keep running statistics cached for min/max @timestamp per index to determine the correct backing index for the document during indexing. Net-new functionality I imagine, but the gains in search performance due to correctly picking one of potentially dozens of ILM rolled-over data stream indices could be substantial under optimal circumstances! Those being: minimal @timestamp lag / range overlap per index, during regular operating conditions.

The can_match phase does help a lot, but there's still a significant cost involved in searching for the appropriate backing index rather than indexing directly into the current write index.

  • The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy.

That's a wrench... I imagine if the user chose that behavior they'd probably expect a DocumentRejectedException and could handle the response application-side, with a couple default strategies (discard, reindex @current, etc.) as options for logstash/filebeat,etc..? Not an easy lift though. It would be cool if alternatively, this feature could have an option to use "ingest time" instead of user-specified document @timestamp, as the backing index value would then be monotonically increasing. Maybe a worthwhile tradeoff in some circumstances, maybe not. Applications could expand the search range to account for indexing latency as needed, and realize greater search performance when not necessary.

If I understand your "index time" suggestion above correctly, that's how data streams work now since all documents are indexed into the current write index regardless of the value of their @timestamp fields.

I can see the value in what you are proposing for data streams and you might open an enhancement request for it. I just wanted to be clear that we had considered this behavior for data streams and chose not to implement it (at least, not so far) because of the difficulties I described above.

@weizijun
Copy link
Contributor

@danhermann Thank you for your reply!

  • The @timestamp does not uniquely identify the correct backing index. In your example above, the timestamp of 2021-07-04 20:00:00 could potentially go into any of the .ds-my-data-stream-2021.07.04-* indices as well as the .ds-my-data-stream-2021.07.03-* or even earlier indices since backing indices are rolled based on size or document count, not on day boundaries. This would require a search to determine the correct backing index for the document and that could significantly reduce indexing performance.

Can this be designed like this? Because it is append only, the latest index in every day's indices supports writing, such as:

  • .ds-my-data-stream-2021.07.04-000001
  • .ds-my-data-stream-2021.07.04-000002
  • .ds-my-data-stream-2021.07.04-000003
  • .ds-my-data-stream-2021.07.05-000001
  • .ds-my-data-stream-2021.07.05-000002
  • .ds-my-data-stream-2021.07.06-000001

Documents can be written to the latest index every day:

  • .ds-my-data-stream-2021.07.04-000003
  • .ds-my-data-stream-2021.07.05-000002
  • .ds-my-data-stream-2021.07.06-000001

The rollover rule of each day index is max_docs or max_size

  • The backing index with the time range for the @timestamp value may have been deleted or set to read-only by an ILM policy. This would mean that a new index would need to be created (undesirable because users usually expect historical data to be managed by ILM), the document would need to be indexed into the current write index (undesirable since the point of this feature would be to group documents by their @timestamp values), or the indexing request for the document would need to be rejected (undesirable because documents with older timestamps should still be accepted).

The index can be set with a expire, such as a 7-day expire. If write a document with timestamp is 7 days ago, it is directly discarded or an exception is thrown.
Or add an expire-before-index, the expired data is written into the index, for example, now it is 07.12, the expiration day is 07.05, and the expired data can be written to . ds-my-data-stream-2021.07.05-before index.
Datastream can also consider supporting custom dateFormat suffixes, such as supporting day, month, and year format, e.g:.ds-my-data-stream-2021.07,.ds-my-data-stream-2021

@danhermann
Copy link
Contributor

Thank you, @weizijun. My intent in responding above was to communicate that we had considered that behavior for data streams and chose not to build it in the initial version due to its complexity. With sufficient effort, I expect it could be added. If that is of interest to you, I would suggest opening a feature request in Github for it.

@weizijun
Copy link
Contributor

Thank you, @weizijun. My intent in responding above was to communicate that we had considered that behavior for data streams and chose not to build it in the initial version due to its complexity. With sufficient effort, I expect it could be added. If that is of interest to you, I would suggest opening a feature request in Github for it.

Thank you, I will try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates Dependency:Endpoint Meta release highlight Team:Data Management Meta label for data/management team Top Ask v7.9.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.