Remove dimension limit for time series data streams #93564

felixbarny · 2023-02-07T16:02:48Z

Description

Currently, there are several limits around the number of dimensions:

Dimension keys have a hard limit of 512b. Documents are rejected if this limit is reached.
Dimension values have a hard limit of 1024b. Documents are rejected if this limit is reached.
The _tsid consists of all dimension keys and values and has a hard limit of 32kb. Documents are rejected if this limit is reached.
To avoid rejecting documents at ingest time due to the hard limit on the _tsid, per default, only 16 fields can be marked as a dimension in the mapping. The limit can be increased with an index setting, however this can lead to document rejections if the hard limit for _tsid is reached.

This limit makes it difficult to adopt time series data streams for a couple of reasons:

Before onboarding a metric, integration developers need to carefully think about whether a field is a dimension or just a metadata/tag.
This isn't always trivial as some metadata is only available in certain conditions (when the application is running on k8s or on cloud). If we over-index and mark too many fields as dimensions, we risk hitting the limit. If we mark too few fields as dimensions it leads to document rejection when trying to index multiple documents with the same timestamp that end up having the same _tsid. It's a fairly labor-intensive and error-prone process to properly mark the right set of fields as dimensions.
It prevents the ingestion of ad-hoc metrics that have an unknown up-front schema.
We'll want to provide users of metric libraries like Micrometer or the OpenTelemetry metrics SDK with an easy way to add new metrics, without previously having to change the schema in ES. Metric libraries usually don't differ between dimensions and metadata. There's typically only a way to set the metric name, attributes (aka labels, tags, dimensions), and a value. So we'll need to map all dynamic labels as dimensions. The metric limit gets in the way of that.
Other TSDBs don't have such a limit.
This will make it harder to move from other TSDBs to Elasticsearch.

I don't want to go too much into implementation details here but we had discussions about potentially turning the _tsid into a hash which would enable to completely remove any limits on the number of dimensions.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-02-07T16:03:16Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

felixbarny · 2023-02-20T14:56:14Z

IIRC, even fields that don't have a value are added to the _tsid. In the context of this issue, I think it makes sense to change this so that unset fields don't impact the _tsid. I guess it matters somewhat less how many fields are added to the _tsid if it's just a hash, as it doesn't increase in size if more fields are added. But unnecessarily adding fields to the _tsid might impact performance.

We might end up with a default dynamic mapping in where every keyword field or every non-metric field (everything except for counter, gauge, or histogram) in order to support dynamic user-defined metrics. This would also be in line with OTel's definition of a time series. Not adding fields that are defined in the data stream but not necessarily used in every time series (such as the container.id, which might not be in every doc in a data stream that has mixed data from bare metal hosts and k8s containers) seems sensible to me.

ruflin · 2023-02-22T10:37:03Z

Loading dimensions through dynamic templates starts to happen with the effort we are driving here: elastic/integrations#5055 All the ECS fields will be added as dynamic templates to the data streams to ensure only the fields used are mapped. This will also include dimensions which are part of ECS.

martijnvg · 2023-02-22T12:57:19Z

even fields that don't have a value are added to the _tsid.

I just checked and this seems not be the case. Only fields with values are added to the tsid.

felixbarny · 2023-02-22T13:43:34Z

Ah, nice. Thanks for checking.

So after this issue gets resolved, is there a good reason why we shouldn't make all non-metric fields a dimension?

A Lucene limitation on doc values for UTF-8 fields does not allow us to write keyword fields whose size is larger then 32K. This limits our ability to map more than a certain number of dimension fields for time series indices. Before introducing this change the tsid is created as a catenation of dimension field names and values into a keyword field. To overcome this limitation we hash the tsid. This PR is intended to be used as a draft to test different options. Note that, as a side effect, this reduces the size of the tsid field as a result of storing far less data when the tsid is hashed. Anyway, we expect tsid hashing to affect compression of doc values and resulting in larger storage footprint. Effect on query latency needs to be evaluated too. Resolves #93564

felixbarny added >enhancement needs:triage Requires assignment of a team area label :StorageEngine/TSDB You know, for Metrics labels Feb 7, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 7, 2023

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Feb 7, 2023

simitt mentioned this issue Feb 8, 2023

PoC: store well defined metrics as times-series data streams elastic/apm-server#9649

Open

felixbarny mentioned this issue Feb 13, 2023

[ECS] [TSDB] Centralisation of Dimension Fields elastic/integrations#5193

Closed

jsoriano mentioned this issue Mar 6, 2023

[TSDB] Dimension limit override settings blocking the build elastic/elastic-package#1174

Closed

felixbarny mentioned this issue Apr 18, 2023

[RFC] Stage 0 - TSDB Dimensions elastic/ecs#2172

Merged

salvatore-campagna mentioned this issue Apr 19, 2023

Increase max number of dimensions from 16 to 21 #95340

Merged

salvatore-campagna mentioned this issue Jul 28, 2023

Hash the tsid to overcome dimensions limits #98023

Merged

salvatore-campagna pinned this issue Sep 21, 2023

DaveCTurner unpinned this issue Sep 23, 2023

elasticsearchmachine closed this as completed in #98023 Feb 1, 2024

This was referenced Mar 14, 2024

feat: store well defined metrics as times-series data streams elastic/apm-server#9730

Closed

Eliminate fingerprinting in Prometheus TSDB input for Elastic 8.13 and above elastic/integrations#9400

Closed

jsoriano mentioned this issue Apr 4, 2024

[Fleet] Support dimension mappings in dynamic templates elastic/kibana#180023

Merged

1 task

gpop63 mentioned this issue May 4, 2024

[prometheus] Eliminate labels fingerprint elastic/integrations#9785

Merged

7 tasks

agithomas mentioned this issue May 17, 2024

[Meta] [TSDB] Eliminate fingerprinting of dimensions and labels added as part of TSDB Enablement elastic/integrations#9910

Open

kaiyan-sheng mentioned this issue Aug 26, 2024

[AWS] Eliminate dimensions fingerprint elastic/integrations#10890

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dimension limit for time series data streams #93564

Remove dimension limit for time series data streams #93564

felixbarny commented Feb 7, 2023 •

edited by zmoog

Loading

elasticsearchmachine commented Feb 7, 2023

felixbarny commented Feb 20, 2023

ruflin commented Feb 22, 2023

martijnvg commented Feb 22, 2023 •

edited

Loading

felixbarny commented Feb 22, 2023

Remove dimension limit for time series data streams #93564

Remove dimension limit for time series data streams #93564

Comments

felixbarny commented Feb 7, 2023 • edited by zmoog Loading

Description

elasticsearchmachine commented Feb 7, 2023

felixbarny commented Feb 20, 2023

ruflin commented Feb 22, 2023

martijnvg commented Feb 22, 2023 • edited Loading

felixbarny commented Feb 22, 2023

felixbarny commented Feb 7, 2023 •

edited by zmoog

Loading

martijnvg commented Feb 22, 2023 •

edited

Loading