Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rollup] Support for data-structure based metrics (Cardinality, Percentiles, etc) #33214

Closed
polyfractal opened this issue Aug 28, 2018 · 13 comments
Labels
>enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@polyfractal
Copy link
Contributor

We would like to support more complex metrics in Rollup such as cardinality, percentiles and percentile ranks. These are trickier since they are calculated from data sketches rather than simple numerics.

They also introduce issues with backwards compatibility. If the algorithm powering the sketch changes in the future (improvements, bug-fixes, etc) we will likely have to continue supporting the old versions of the algorithm. It's unlikely that these sketches will be "upgradable" to the new version since they are lossy by nature.

I see two approaches to implementing these types of metrics:

New data types

In the first approach, we implement new data types in the Rollup plugin. Similar to the hash, geo or completion data types, these would expect input data to adhere to some kind of complex format. Internally it would be stored as a compressed representation that could be used to build the sketch (e.g. a long[] which could be used to build a HLL sketch).

The pro's are strong validation and making it easier for aggregations to work with the data. Another large positive is that it allows external clients to provide pre-built sketches as long as they follow the correct format. For example, edge-nodes may be collecting and aggregating data locally and just want to send the sketch.

The cons are considerably more work implementing the data types. It may also not be ideal to expose these data structures outside Rollup, since they carry the aforementioned bwc baggage.

Convention-only types

Alternatively, we could implement these entirely by convention (like the rest of Rollup). E.g. a binary field can be used to hold the appropriate data sketch, and we just use field naming to convey the meaning. Versioning can be done with a secondary field.

The advantage is much less upfront work...we can just serialize into fields and we're off. It also limits the impact of these data types, since only Rollup will be equipped to deal with the convention (less likely for a user to accidentally use one and then run into trouble later).

Big downside is that external clients will have a more difficult time providing pre-built sketches, since the format is just convention and won't be validated until search time. It also feels a bit more fragile since it is another convention to maintain.

BWC

In both cases, Rollup will probably have to maintain a catalog of "old" algorithms so that historical rollup indices can continue to function. Not ideal, but given that these algos don't change super often it's probably an ok burden to bear.

@polyfractal polyfractal added >enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data labels Aug 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@polyfractal
Copy link
Contributor Author

Small update.

  • We discussed this some more and decided the likely approach will be explicit data types (implementing field mappers, etc). This allows stronger validation, and better control over the experience
  • We will probably start with percentiles, HDRHistogram in particular, because it is the simplest and most robust sketch to implement. There are also fewer forward-compatibility concerns, since the sketch itself could be translated into a new algorithm with some bounded error (because the sketch is really just a histogram of buckets arranged in a special manner)
  • A "pre-aggregated percentiles" query will need to be implemented at the same time, otherwise the datastructure is useless

@pmoust
Copy link
Member

pmoust commented Dec 16, 2018

Relates: #24468

@painslie
Copy link

painslie commented Mar 29, 2019

Hi @polyfractal, do you know when this slated to go into production?

@polyfractal
Copy link
Contributor Author

Hi @painslie, I'm afraid I do not have an update. We'll update this issue when there's more information, or link to it from a PR.

@pcsanwald
Copy link
Contributor

@polyfractal I'm curious how well the promethus histogram would line up with what you're thinking?

@polyfractal
Copy link
Contributor Author

HDRHistogram is essentially just a clever layout of different-sized intervals: a set of exponentially-sized intervals, with a fixed number of linear intervals inside each exponential "level". But at it's heart, it's still just a histogram of counts like Prometheus histos (and unlike algos like TDigest which are weighted centroids, etc).

So it should be possible to translate a Prometheus histogram into an HDRHisto. Prometheus histos have user-definable intervals, which means the accuracy of translation will depend on how nicely the Promtheus histos line up with the HDRHisto intervals. I think any Prometheus histo should be convertible, and the accuracy of that conversion depends on the exact layout.

Prometheus Summaries are an implementation of Targeted Quantiles and will be much harder to use. The output of a summary is just a percentile estimation at that point in time, which is mostly useless to us. It might be possible to convert the underlying Targeted Quantiles sketch into a TDigest since the algos share some similarities, but I suspect it won't give great accuracy. I've been told summaries aren't as common either compared to Histos, so also probably not a priority.

With all that said, it's still not entirely clear how a user will convert a prometheus (or any other system's histogram output) into our datastructure. I'm kinda thinking an ingest processor would make the most sense, slurping up a prometheus histo and emitting a compatible HDRHisto-field. But I haven't spent a lot of time thinking about the ergonomics of that yet. :)

@vipul657
Copy link

vipul657 commented May 8, 2019

Hi @polyfractal is there any ticket for adding weighted average support in pack rollups?

@amontalenti
Copy link

@polyfractal A quick update here. @kbourgoin and I have implemented a custom field type for serialized HLL rollups in the ES index, along with a corresponding aggregation query that works much like cardinality, but de-serializes and merges multiple serialized document-stored HLL blobs. We've built it as a proper Elasticsearch plugin and presented it in NYC at a local Elasticsearch meetup yesterday, and it's almost ready for review by Elastic folks. I'll be writing up my slides into a technical blog post, as well, so people can try it out. It's not quite ready for production, but it's getting there. Would be good to sync up about this, as I'm sure it can help inform the similar approach for HDRHistogram and percentiles.

@jpountz
Copy link
Contributor

jpountz commented Oct 23, 2019

Excellent. We have had some discussions on our end as well on what the API and implementation could look like for a histogram field for percentile aggregations and a HLL++ field for cardinality aggregations. I suspect both impls will end up looking similar. :) cc @iverase

@polyfractal
Copy link
Contributor Author

polyfractal commented Jan 10, 2020

Small note: histograms have been implemented in #48580 (:tada:). Support in Rollup is still pending... we may want to wait for #42720

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@wchaparro
Copy link
Member

We will plan to build this support in Downsampling. Support for Histograms in Downsampling is pending, Design is in place, ready to be prioritized as soon as we have availability.

With the 8.7 release of Elasticsearch, we have made a new downsampling capability associated with the new time series datastreams functionality generally available (GA). This capability was in tech preview in ILM since 8.5. Downsampling provides a method to reduce the footprint of your time series data by storing it at reduced granularity. The downsampling process rolls up documents within a fixed time interval into a single summary document. Each summary document includes statistical representations of the original data: the min, max, sum, value_count, and average for each metric. Data stream time series dimensions are stored unchanged.

Downsampling is superior to rollup because:

  • Downsampled indices are searched through the _search API
  • It is possible to query multiple downsampled indices together with raw data indices
  • The pre-aggregation is based on the metrics and time series definitions in the index mapping so very little configuration is required (i.e. much easier to add new time serieses)
  • Downsampling is managed as an action in ILM
  • It is possible to downsample a downsampled index, and reduce granularity as the index ages
  • The performance of the pre-aggregation process is superior in downsampling, as it builds on the time_series index mode infrastructure

Because of the introduction of this new capability, we are deprecating the rollups functionality, which never left the Tech Preview/Experimental status, in favor of downsampling and thus we are closing this issue. We encourage you to migrate your solution to downsampling and take advantage of the new TSDB functionality.

@wchaparro wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jun 23, 2023
@lasseschou
Copy link

@wchaparro the new downsampling feature looks great, but it still doesn't support percentiles. Downsampling a fixed set of percentiles such as median, 75th, 90th, 95th and 99th, is a very common use case for reporting latencies so I bet a lot of ElasticSearch users could benefit for having percentiles in the downsample feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests