-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dimension limit for time series data streams #93564
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
IIRC, even fields that don't have a value are added to the _tsid. In the context of this issue, I think it makes sense to change this so that unset fields don't impact the _tsid. I guess it matters somewhat less how many fields are added to the _tsid if it's just a hash, as it doesn't increase in size if more fields are added. But unnecessarily adding fields to the _tsid might impact performance. We might end up with a default dynamic mapping in where every keyword field or every non-metric field (everything except for counter, gauge, or histogram) in order to support dynamic user-defined metrics. This would also be in line with OTel's definition of a time series. Not adding fields that are defined in the data stream but not necessarily used in every time series (such as the |
Loading dimensions through dynamic templates starts to happen with the effort we are driving here: elastic/integrations#5055 All the ECS fields will be added as dynamic templates to the data streams to ensure only the fields used are mapped. This will also include dimensions which are part of ECS. |
I just checked and this seems not be the case. Only fields with values are added to the tsid. |
Ah, nice. Thanks for checking. So after this issue gets resolved, is there a good reason why we shouldn't make all non-metric fields a dimension? |
A Lucene limitation on doc values for UTF-8 fields does not allow us to write keyword fields whose size is larger then 32K. This limits our ability to map more than a certain number of dimension fields for time series indices. Before introducing this change the tsid is created as a catenation of dimension field names and values into a keyword field. To overcome this limitation we hash the tsid. This PR is intended to be used as a draft to test different options. Note that, as a side effect, this reduces the size of the tsid field as a result of storing far less data when the tsid is hashed. Anyway, we expect tsid hashing to affect compression of doc values and resulting in larger storage footprint. Effect on query latency needs to be evaluated too. Resolves #93564
Description
Currently, there are several limits around the number of dimensions:
This limit makes it difficult to adopt time series data streams for a couple of reasons:
This isn't always trivial as some metadata is only available in certain conditions (when the application is running on k8s or on cloud). If we over-index and mark too many fields as dimensions, we risk hitting the limit. If we mark too few fields as dimensions it leads to document rejection when trying to index multiple documents with the same timestamp that end up having the same
_tsid
. It's a fairly labor-intensive and error-prone process to properly mark the right set of fields as dimensions.We'll want to provide users of metric libraries like Micrometer or the OpenTelemetry metrics SDK with an easy way to add new metrics, without previously having to change the schema in ES. Metric libraries usually don't differ between dimensions and metadata. There's typically only a way to set the metric name, attributes (aka labels, tags, dimensions), and a value. So we'll need to map all dynamic labels as dimensions. The metric limit gets in the way of that.
This will make it harder to move from other TSDBs to Elasticsearch.
I don't want to go too much into implementation details here but we had discussions about potentially turning the
_tsid
into a hash which would enable to completely remove any limits on the number of dimensions.The text was updated successfully, but these errors were encountered: