Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give a weight for documents in aggregations #8094

Closed
rnonnon-ebiz opened this issue Oct 15, 2014 · 2 comments
Closed

Give a weight for documents in aggregations #8094

rnonnon-ebiz opened this issue Oct 15, 2014 · 2 comments
Labels

Comments

@rnonnon-ebiz
Copy link

Hi all,

I'm wondering something about aggregations, say "Percentiles" (although it could be fine to get it with other aggregations).
When a percentile aggregation is processed, it uses a specific field as a reference. If the 50th percentile for field 'f' is 10, it means there are 50% of documents with 'f' under 10.
=> Each document has the same weight in the aggregation ( => 1)

I'm wondering if it could be possible to give a different weight for each document using another field in the document.
The following Gist give an example of what I'd like to do : https://gist.github.com/rnonnon/093c111014bd14a46efe

I'd like to compute some percentiles on the "age" field. But for each document, there is a "count" field associated.
For example, there are 5 persons who are 10 years old ; 1 who is 20 years old...
If the percentile agg runs, it won't use my factor(number of person) to compute the percentile, it will count the number of documents...
I don't think that feature is natively supported, but, do you guess it could be easily supported? Do you think it makes sense to implement that?

Why am I asking this?
I'm using percentile (and other) aggregation over around 70 000 000 documents and I use only 1 node. ES uses my 8 cores at 100% for a while :s... Then I try to reduce the number of documents by grouping them, but I can't use aggregation in the same way...

Thanks.

@jpountz
Copy link
Contributor

jpountz commented Feb 20, 2015

Your suggestion is achievable with a script that would create an array that contains count occurrences of the age value but it would not make things faster I'm afraid.

The algorithm that we use for percentiles (t-digest) is not so fast because it tries to work on all kinds of data.

We have another issue open in order to add support to HdrHistogram: #8324. It is faster but has relative accuracy: percentiles would be more accurate when values are close to 0 and vice-versa. This typically works very well when working with eg. response times (since you care about microsecond precision for millisecond response times, but usually only about second precision for hour response times). Would it work in your case too?

@clintongormley
Copy link
Contributor

No further feedback. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants