Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dimension limit for time series data streams #93564

Closed
felixbarny opened this issue Feb 7, 2023 · 5 comments · Fixed by #98023
Closed

Remove dimension limit for time series data streams #93564

felixbarny opened this issue Feb 7, 2023 · 5 comments · Fixed by #98023
Labels
>enhancement :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@felixbarny
Copy link
Member

felixbarny commented Feb 7, 2023

Description

Currently, there are several limits around the number of dimensions:

  • Dimension keys have a hard limit of 512b. Documents are rejected if this limit is reached.
  • Dimension values have a hard limit of 1024b. Documents are rejected if this limit is reached.
  • The _tsid consists of all dimension keys and values and has a hard limit of 32kb. Documents are rejected if this limit is reached.
  • To avoid rejecting documents at ingest time due to the hard limit on the _tsid, per default, only 16 fields can be marked as a dimension in the mapping. The limit can be increased with an index setting, however this can lead to document rejections if the hard limit for _tsid is reached.

This limit makes it difficult to adopt time series data streams for a couple of reasons:

  • Before onboarding a metric, integration developers need to carefully think about whether a field is a dimension or just a metadata/tag.
    This isn't always trivial as some metadata is only available in certain conditions (when the application is running on k8s or on cloud). If we over-index and mark too many fields as dimensions, we risk hitting the limit. If we mark too few fields as dimensions it leads to document rejection when trying to index multiple documents with the same timestamp that end up having the same _tsid. It's a fairly labor-intensive and error-prone process to properly mark the right set of fields as dimensions.
  • It prevents the ingestion of ad-hoc metrics that have an unknown up-front schema.
    We'll want to provide users of metric libraries like Micrometer or the OpenTelemetry metrics SDK with an easy way to add new metrics, without previously having to change the schema in ES. Metric libraries usually don't differ between dimensions and metadata. There's typically only a way to set the metric name, attributes (aka labels, tags, dimensions), and a value. So we'll need to map all dynamic labels as dimensions. The metric limit gets in the way of that.
  • Other TSDBs don't have such a limit.
    This will make it harder to move from other TSDBs to Elasticsearch.

I don't want to go too much into implementation details here but we had discussions about potentially turning the _tsid into a hash which would enable to completely remove any limits on the number of dimensions.

@felixbarny felixbarny added >enhancement needs:triage Requires assignment of a team area label :StorageEngine/TSDB You know, for Metrics labels Feb 7, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 7, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@felixbarny
Copy link
Member Author

IIRC, even fields that don't have a value are added to the _tsid. In the context of this issue, I think it makes sense to change this so that unset fields don't impact the _tsid. I guess it matters somewhat less how many fields are added to the _tsid if it's just a hash, as it doesn't increase in size if more fields are added. But unnecessarily adding fields to the _tsid might impact performance.

We might end up with a default dynamic mapping in where every keyword field or every non-metric field (everything except for counter, gauge, or histogram) in order to support dynamic user-defined metrics. This would also be in line with OTel's definition of a time series. Not adding fields that are defined in the data stream but not necessarily used in every time series (such as the container.id, which might not be in every doc in a data stream that has mixed data from bare metal hosts and k8s containers) seems sensible to me.

@ruflin
Copy link
Member

ruflin commented Feb 22, 2023

Loading dimensions through dynamic templates starts to happen with the effort we are driving here: elastic/integrations#5055 All the ECS fields will be added as dynamic templates to the data streams to ensure only the fields used are mapped. This will also include dimensions which are part of ECS.

@martijnvg
Copy link
Member

martijnvg commented Feb 22, 2023

even fields that don't have a value are added to the _tsid.

I just checked and this seems not be the case. Only fields with values are added to the tsid.

@felixbarny
Copy link
Member Author

Ah, nice. Thanks for checking.

So after this issue gets resolved, is there a good reason why we shouldn't make all non-metric fields a dimension?

@salvatore-campagna salvatore-campagna pinned this issue Sep 21, 2023
@DaveCTurner DaveCTurner unpinned this issue Sep 23, 2023
elasticsearchmachine pushed a commit that referenced this issue Feb 1, 2024
A Lucene limitation on doc values for UTF-8 fields does not allow  us to
write keyword fields whose size is larger then 32K. This limits  our
ability to map more than a certain number of dimension fields  for time
series indices. Before introducing this change the tsid is created as a
catenation of dimension field names and values into a keyword field.

To overcome this limitation we hash the tsid. This PR is intended to be
used as a draft to test different options.

Note that, as a side effect, this reduces the size of the tsid field as
a result of storing far less data when the tsid is hashed. Anyway, we
expect tsid hashing to affect compression of doc values and resulting in
larger storage footprint. Effect on query latency needs to be evaluated
too.

Resolves #93564
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants