-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to perform computations on aggregations #9876
Comments
+1 |
1 similar comment
+1 |
+1 I want to have a histogram aggregation that can use the |
Big +1 on this! |
+1 |
1 similar comment
👍 |
Is it possible to do these awkwardly now using scripted aggregations? Is that something Kibana4 can take advantage of if they are there? |
@aschokking Nope, there's no way to hack this right now...if you want this functionality, you currently have to build it client-side yourself. This new functionality essentially adds one or more extra From a high level, it looks like this:
The new functionality introduces a fourth step:
We are keeping close communication with the Kibana team, since they want to use a lot of this functionality. And none of this will "break" existing aggregations; in fact, all the new aggs look just like the old aggs. So Kibana will be able to implement them as they arrive in Elasticsearch, no need for a new major version or anything. |
Thanks for clarifying @polyfractal, that makes sense. |
very nice! |
+1 on adding the secondard reduces, will it be limited to a two level aggregation or can more levels be possible? I'd suggest a modification to "Aggregation to calculate the (mean) average value of the buckets in a given aggregation" to be "Aggregation to calculate the any/all of the extended_stats values of the buckets in a given aggregation, e.g. after a terms aggregation". This allows each bucket to be given an equal weight regardless of the number of documents in the underlying buckets. |
@lewchuk The new functionality should be able to work in multi-level aggregations. E.g. you can embed these new aggs at multiple levels in the aggregation tree. Depending on the agg, they may have certain requirements which must be satisfied (e.g. a Most of these new aggs also support "chaining". For example, you could calculate acceleration by taking the derivative of a derivative of position. Or do something like take the moving average of the derivative of the position. Etc etc :)
I believe the plan is to support all the basic "arithmetic" functions, not just mean. So mean/min/max/sum/etc. Basically mirroring the existing set of metrics...but for agg values instead of document values. |
@polyfractal Thanks for the clarification! Will be very excited to unleash the power of these new aggregations. |
periodicity/seasonality stuff sounds interesting. we would like to do detection of customer attrition, many of whom have seasonal behaviour based on the vertical of their industry. this feature sounds like it could help eliminate false positives. |
Adds a new type of aggregation called 'reducers' which act on the output of aggregations and compute extra information that they add to the aggregation tree. Reducers look much like any other aggregation in the request but have a buckets_path parameter which references the aggregation(s) to use. Internally there are two types of reducer; the first is given the output of its parent aggregation and computes new aggregations to add to the buckets of its parent, and the second (a specialisation of the first) is given a sibling aggregation and outputs an aggregation to be a sibling at the same level as that aggregation. This PR includes the framework for the reducers, the derivative reducer (#9293), the moving average reducer(#10002) and the maximum bucket reducer(#10000). These reducer implementations are not all yet fully complete. Known work left to do (these points will be done once this PR is merged into the master branch): Add x-axis normalisation to the derivative reducer Add lots more JUnit tests for all reducers Contributes to #9876 Closes #10002 Closes #9293 Closes #10000
+1 |
I've had a close look at the documentation of the upcoming pipeline aggregations. Quite an exciting stuff 😃 Yet there's a very important, I'd say capital functionality missing. The primary reason for using the server-side post-aggregations is not laziness (at least not in my case), but performance: it might be killing for your application to receive tons of data on the wire and then crunch them a while to finally spit just a few numbers. All pipeline aggregations should have a |
In the meantime I found the |
Read your blog on pipeline aggregations (https://www.elastic.co/blog/out-of-this-world-aggregations) Really nice thank you I would be interested in few more pipeline aggregations (or rather transformations)
|
@roytmana I like the idea of (1) flattening aggs into columns (2) and (3) sound like they could be achieved very easily with https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-bucket-script-aggregation.html |
ok @clintongormley I will play with script pipeline when the beta is out. If you decide to go ahead with (1) I would be happy to provide some use cases. |
@roytmana just chatted to @colings86 and apparently (2) isn't supported by the bucket_script agg yet. But we should definitely add support |
Hmmmm actually rereading (2) I'm not entirely sure if I understood it correctly. The examples you provide are quite different, eg:
Am I missing something? The bit that I said was unsupported by bucket_script was the ability to access two separate histograms |
let me try to elaborate a bit As for open/closed. I do not think it could be two metrics in one bucket they are two different fields to bucket on. Here is requirement: I want to calculate number of cases and cost of cases opened and closed in each fiscal year and show them side by side. I have two fields OpenFY and ClosedFY which are pre-calculated. I want to show a chart with two data series one for opened and one for closed (counts and cost). Open an closed are two independent fields (It is even possible that there could be a year when there was no closed at all so there will not be a bucket for this FY in closed) I want to agg on the first and on the second and then merge results by FY so each bucket will get open and closed metrics together. I do it currently in post processing but I think result tree manipulation support directly in ES would be really useful! One more question I have is about nested and reverse_nested (same for parent) aggregations. They introduce extra level in result tree which I am not sure is necessary. It only changes calculation scope but should not alter result tree depth. It makes it rather a headache to deal with it in dynamic metadata driven systems where users do not care how data is laid out they just pick how to aggregate and what to calculate and I may have to cross nested back and forth to accommodate it. Right now in post processing I have to transform my results by removing these extra nodes created due to nested/reverse_nested (a royal headache in entirely dynamic system) before passing it to UI level. I was wondering if it would introduce any problem (name clash?) if nested/reverse_nested did not introduce a separate node and all its subaggs emitted their results into agg owning the nested one. |
I want to add that nested/reverse_nested introducing extra levels in result tree is not a trivial matter. |
Lag or Timeshift Aggregation: Sort of a generalization of the serial differencing agg which only provides the lag functionality, allowing you to perform operations on values in different buckets (from the same or bucket aggregations.) Use case: Cohort retention analysis, where I want to see what percentage of users come back the day after their first day. I could do this by bucketing by day and by filtering on both |
@tmandry hmm, I can see this being useful. Would you need/want a newly created field to be appended to each bucket, like: "buckets": [
{
"key_as_string": "2014-07-29T17:00:00.000Z",
"key": 1406653200000,
"doc_count": 7,
"login_today": { // <-- original, derived from something like an `avg` metric
"avg": 1
},
"login_yesterday": { // <-- derived and shifted via a `timeshift` agg
"avg": 1
}
}, Or would it be sufficient if the Thinking about it, the advantage of actually appending a new bucket is that you can use something like |
@polyfractal For my use case, the |
Been working with pipelines more extensively on a demo project. A few observations about what is difficult:
|
could be: |
Query String with Aggregation parameters works fine with JEST client. but with TCP, is it always mandatory to build AggregationBuilder to execute aggregation? Why JSON aggregation query is not supported in TCP? any specific reason for this? |
A "Moving Standard Deviation" pipeline aggregation would be useful. If we can calculate that on the server we could also create a "Relative Standard Deviation" aggregation which would use a "Moving Average" aggregation and the "Moving Standard Deviation" aggregation. This would be useful to calculate the +/- for various metrics. For instance, with a Web server I may want to calculate volatility and I could use "Relative Standard Deviation" to see +/- how many client requests I have over time or +/- the sum of bytes served per window, etc. Possibly this could be used with the predictive aggregations to let me get an idea of how much capacity I'll need during various seasons, times of day, etc. |
I agree a "Moving Standard Deviation" pipeline aggregation would be useful. I want to do the statistical control for a time series count data. I can get the moving average of the daily count, but in order to compute the control limit I need a moving standard deviation of the count. |
I don't see how practically calculate lets say average site visit duration. "avg_page_view_time_avg_per_visit" calculates correct result, great! It would be great if this kind of structure could be configure to return just response without intermediate steps. For example in relational DB it would be done in two selects. Internal would would count average time per visit and external average time per visits returning only one row with final result. You don't want your DB to return all the possible temporary results. Something similar would be nice to have in ES! |
@colings86 @polyfractal can this issue be closed now, or do you want to keep the unimplemented list around? |
@clintongormley yes, i think we can close this issue as we have the core functionality this issue was created to address. New aggregations can be requested and added in separate issues/PRS, this way it will be easier to discuss them |
Is there any plan to do "Agg for building a sliding_histogram" ?
I am happy to contribute to this work, any consolidated doc / example will help. |
@hienchu my original intention for a sliding_histogram is a bit different I think. I had intended it to be a histogram with an interval and a window such that the output would be buckets whose bounds range is the window period and the change in the buckets bounds from one bucket to the next is the interval. For example you might have an interval of 1 hour and a window of one day. In this case the output would be buckets for Does that fit into what you are thinking here? It might be a good idea if you raised a new ticket for this and then we can iterate on the idea there? |
There are many instances where it is useful to perform computations on the output of aggregations to calculate new aggregations. This meta issue aims to summarize the functionality we would like to add to the aggregations framework to allow different types of computation to be performed during the reduce phase of aggregations.
This set of new aggregations are the highest priority, given their utility in a wide range of scenarios:
At the moment, the remainder of the list is largely explorative, to see which ideas/functionality makes sense and have community interest. Feel free to suggest your own ideas/aggregations/algos!
stats_bucket
/extended_stats_bucket
pipeline aggs #13128 Aggregation to calculatestats
andextended_stats
values of the buckets in a given aggregationAggregation to calculate the number of buckets in a given aggregation #11008 Aggregation to calculate the number of buckets in a given aggregationAggregation to calculate the cardinality of a metric in a given aggregation #11009 Aggregation to calculate the cardinality of a metric in a given aggregationpercentiles_bucket
pipeline aggregation #13186 Agg to calculate percentilesnth
bucket, and/or selecting a range + truncatingThe text was updated successfully, but these errors were encountered: