-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Streams #53100
Comments
Pinging @elastic/es-core-features (:Core/Features/Indices APIs) |
This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to elastic#53100
This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to elastic#53100
* Initial data stream commit This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to #53100 * fixed hlrc test * ignore bwc until this change has been backported to 7.x branch * changed data stream apis to be a cluster based action. before this commit the data steams api were indices based actions, but data streams aren't indices, data streams encapsulates indices, but are indices themselves. It is a cluster level attribute, and therefor cluster based action fits best for now. Perhaps in the future we will have data stream based actions and then this would be a right fit for the data stream crud apis. * this should have been part of the previous commit * fixed yaml test * Also add feature flag in other modules that run the yaml test if a release build is executed * Reverted the commits that make data stream a cluster based api This reverts commit e362eeb. * Make data stream crud apis work like a indices based api. * renamed timestamp field * fixed compile error after merging in master * fixed merge mistake * moved setting system property * applied review comments
This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to elastic#53100
This commits adds a data stream feature flag, initial definition of a data stream and the stubs for the data stream create, delete and get APIs. Also simple serialization tests are added and a rest test to thest the data stream API stubs. This is a large amount of code and mainly mechanical, but this commit should be straightforward to review, because there isn't any real logic. The data stream transport and rest action are behind the data stream feature flag and are only intialized if the feature flag is enabled. The feature flag is enabled if elasticsearch is build as snapshot or a release build and the 'es.datastreams_feature_flag_registered' is enabled. The integ-test-zip sets the feature flag if building a release build, otherwise rest tests would fail. Relates to #53100
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Introduced a `AliasOrIndex.Type` enum to indicate what a `AliasOrIndex` instance is. * Replaced the `isAlias()`` method that returns a boolean with the `getType()`` method that returns the new Type enum. * Moved `getWriteIndex()`` up from the `AliasOrIndex.Alias` to the `AliasOrIndex` interface. * Moved `g`etAliasName()`` up from the `AliasOrIndex.Alias` to the `AliasOrIndex` interface and renamed it to `getName()``. * Removed unnecessary casting to `AliasOrIndex.Alias` by just checking the `getType()`` method. Finally `AliasOrIndex` should be renamed to reflect that it can be more than just an index or alias, since in the near future it can also be a data stream. The name AliasOrIndexOrDataStream is not appealing to me. We can rename it to `Namespace`, but that sounds to generic to me. `ResolvedIndicesExpression` sounds better to me, since it reflects more what it is (an expression from api that has been resolved to alias/index/datasteam), but the name itself is a bit on the long side. Relates to elastic#53100
I've updated the create data stream api section. The first backing index is now created as part of the create data stream api call and no longer created lazily. The assumption for this api, is that if data streams are explicitly created then the new data stream should be ready to be used. Index templates v2 will provide the ability to lazily create a data stream. |
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Renamed `AliasOrIndex` to `IndexAbstraction`. * Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is. * Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum. * Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface. * Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`. * Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method. Relates to #53100
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Renamed `AliasOrIndex` to `IndexAbstraction`. * Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is. * Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum. * Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface. * Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`. * Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method. Relates to elastic#53100
Backport of #53982 In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams, the abstraction needs to be made more flexible, because currently it really can be only an alias or an index. * Renamed `AliasOrIndex` to `IndexAbstraction`. * Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is. * Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum. * Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface. * Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`. * Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method. Relates to #53100
so that it can used in the next minor release (7.9.0). Closes #53100
so that it can used in the next minor release (7.9.0). Closes elastic#53100
hi, @martijnvg , as @timestamp is a part of datastream, when writing data to datastream, can the backing index made by the the actual content of @timestamp instead of the latest index? Such as, @timestamp=1625400000000(2021-07-04 20:00:00) , doc write into .ds-my-data-stream-2021.07.04-000001 not the latest index, eg: .ds-my-data-stream-2021.07.05-000002 |
@weizijun, that could be nice for certain use cases, but there are a number of difficulties that arise when attempting to determine which backing index should receive the document. Two of the biggest issues are:
|
Just chiming in here because I have a personal interest in Elasticsearch and this feature could be of great use to the organization I currently work for 😄
Is this something the existing can_match index metadata could help with, as opposed to a full regular search? e.g. keep running statistics cached for min/max @timestamp per index to determine the correct backing index for the document during indexing. Net-new functionality I imagine, but the gains in search performance due to correctly picking one of potentially dozens of ILM rolled-over data stream indices could be substantial under optimal circumstances! Those being: minimal @timestamp lag / range overlap per index, during regular operating conditions.
That's a wrench... I imagine if the user chose that behavior they'd probably expect a Anyways, thanks for your contributions to Elasticsearch! |
The
If I understand your "index time" suggestion above correctly, that's how data streams work now since all documents are indexed into the current write index regardless of the value of their I can see the value in what you are proposing for data streams and you might open an enhancement request for it. I just wanted to be clear that we had considered this behavior for data streams and chose not to implement it (at least, not so far) because of the difficulties I described above. |
@danhermann Thank you for your reply!
Can this be designed like this? Because it is append only, the latest index in every day's indices supports writing, such as:
Documents can be written to the latest index every day:
The rollover rule of each day index is max_docs or max_size
The index can be set with a expire, such as a 7-day expire. If write a document with timestamp is 7 days ago, it is directly discarded or an exception is thrown. |
Thank you, @weizijun. My intent in responding above was to communicate that we had considered that behavior for data streams and chose not to build it in the initial version due to its complexity. With sufficient effort, I expect it could be added. If that is of interest to you, I would suggest opening a feature request in Github for it. |
Thank you, I will try! |
Update: the description is out dated, for more information take a look at the unreleased data streams docs.
The meta issue will track the development of a new feature named data streams.
Background
Data streams are targeted towards time-based data sources and enable us to solve bootstrap problems when indexing via a write alias (logstash-plugins/logstash-output-elasticsearch#858). Today aliases have some deficiencies in how they are implemented in Elasticsearch; namely they are not first-class concepts, making them confusing to use. Aliases have a broad type of usages, whereas data streams will be a solution focused for time-based data sources. Data streams should be a first-class concept, but should also be non intrusive.
Concept
A data stream formalizes the notion of a stream of data for time based data sources. Data streams groups indices from the same time-based data source together as an opaque container. A data stream keeps track of a list of indices ordered by generation. A generation starts at 0 and each time the stream is rolled over the generation is incremented by one. Writes are forwarded to the index with the highest generation (last index). Searches and other multi-index APIs forward requests to all indices that are part of a data stream (this is similar to how aliases are resolved in these APIs).
Because data streams are aimed at time-series data sources, a date field is required and must be identified as the "timestamp" for the documents in the data stream. This will enable Kibana to detect that is dealing with time-series data automatically, and we can internally apply some optimizations (e.g., automatically sorting on the timestamp field).
Indices that are contained with a data stream are hidden. The idea is that users interact with the data stream as much as possible and not directly with the indices that are contained within a data stream.
Data streams only accept append-only writes (index requests with op_type=create). Deletes and updates are rejected. If specific documents need to be updated or deleted then these operations should happen via the index these documents reside in. The reason that these write operations are rejected via a data stream is that these operations work as expected until a data stream is rolled over and then these operations result in 404 errors. Therefore it is better to reject these operations consistently.
The rollover API needs to understand how to handle a data stream. The data stream’s generation needs to be incremented and a new hidden index needs to be created atomically. The name of the index is based on the name of the data stream and its current generation, which looks like this: [datastream-name]-[datastream-generation].
A data stream can be created with the create data stream API and remove via the delete data stream API.
It should also be possible to reserve a namespace to be a data stream before actually creating the data stream. For this data streams will depend on index templates v2. Index templates will get an additional setting, named data_stream, which will create a data stream with the name of what should be the concrete index and a hidden index:
Additionally, data streams will be integrated into Kibana. For example, Kibana can automatically generate index patterns based on data streams, and identity the timestamp field.
Data Streams and ILM
The main change will be that it will no longer be required to configure index.lifecycle.rollover_alias setting for ilm. ILM can automatically figure out if an index is part of a datastream and act accordingly. An index can only be part of a single data stream, the user doesn’t create the index and the index is hidden, because of this clear structure ILM can make assumptions and doesn’t need additional configuration. Compared to using aliases (even with alias templates) this wouldn’t be true and this is a big upside of using datastreams.
ILM should also be able to update data streams atomically. For example in the context of ILM’s shrink action, the data stream should be updated to not refer to the un-shrunken index and refer to the shrunken index. For the rest, ILM should be able work as it is today.
Integration with security
TBD
APIs
The APIs and they way data streams are used from other APIs is not final and may change in the future.
Index expressions and data streams in APIs
Data streams should be taken into account when resolving an index expression. Also the API that resolves an index expressions is important. If a multi-index API tries to resolve the name of a data stream then all the indices of a stream should be resolved and if a write API tries to resolve the name of a data stream then only the latest index should be resolved.
The following write APIs will resolve a data stream to the latest index: bulk API and index API.
The following multi index APIs should resolve a data stream to all indices it contains: search, msearch, field caps and eql search.
Single document read APIs should fail when being used via a data stream. The following APIs fall into this category: explain, get, mget, termvector.
There are many admin APIs that are multi index. These apis should be able to resolve data streams and resolve to the latest hidden index backed by a data stream. Examples of these APIs are: put mapping, get mapping, get settings or get index.
The rollover API both accepts a write alias and a data stream.
Get index API
The get index api should if an index is part of a data stream include what data stream it is part of.
Create data stream API
Request:
The create data stream API allows to create new data stream and the first backing index. This API creates a new data stream with the provided name and the provided timestamp field. The generation is set to 0. The backing index is created with the following name: '[data-stream-name]-000000'. The settings & mappings originate from any index template that match.
If an existing data stream, index or alias already exists with the same name as the provided data stream name then the create data stream will return with an error. Also if indices or aliases exist with the same prefix in the name as the provided data stream name then an error is returned as well.
Get data streams API
Request:
Returns the list of data streams based on the specified name and for each data stream additional meta data is included. (for example list of backing indices and current generation). If no name is provided then all data streams are returned. Also wildcard expressions are supported.
Delete data stream API
Request:
Deletes the specified data stream. Also the indices that are part of the data stream are removed.
Updating a data stream
TBD
A data stream can not be updated to include allow system or indices that are already part of data stream.
The text was updated successfully, but these errors were encountered: