You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to propose an aggregation that "selects" a metric from a document according to some kind of ordering criteria on a second field. For example, you may want the most recent latency value within a date_histogram bucket: in this case, the "metric" is the latency field, and the ordering criteria is timestamp DESC, size: 1.
This is a fairly common use-case which is difficult to accomplish today. top_hits can give you the information, but it fetches an entire document and is not compatible with pipeline aggregations. It is also fairly expensive if many values/documents are being fetched. You can sometimes get the required information with clever usages of other aggs (like a max agg, or scripting) to pull out the document you're looking for, but they are fragile and hacky approaches.
The WeightedAvg agg added support for multiple ValuesSources, so a "metric selector" should not be too difficult to implement.
All naming is tentative, open to better suggestions! :)
Note how the sort values are ordered descending per-bucket, and it returns a single metric value for each sort value. There may be 1000 documents in a bucket, but unlike other aggregations this actually returns n individual values from the documents themselves. If there are ties, there would be multiple objects with the same sort.
Misc
I have a crude prototype which demonstrates the feasibility.
We will need some kind of limit on size to prevent abuses. It should be fairly easy to track in a breaker, so that might be sufficient. I would feel better if there was a hard/soft limit though :) Like top_hits, this should be used to fetch a handful of values not an entire index
We should support sorting on non-numeric fields too (keyword, etc).
I'm less sure we need to support non-numeric metrics. I think starting with numerics-only is fine
We can probably optimize the no-parent scenario with a BKD lookup similar to how min/max work today. Not necessary for the first iteration
As long as we only support asc/desc (e.g. the min or max values of a field), we shouldn't run into top-n accuracy issues like terms agg can have. Each shard will always send it's n min/max values and the coordinator will assemble a global min/max list. It might be that all top n values have the same sort key and others are omitted, but this is not incorrect since we are displaying individual results and not grouping.
Potential naming idea: top_metric or similar, to parallel top_hits. Both signals that it has similar functionality, and doesn't confuse with bucket_selector which is quite different.
I'd like to propose an aggregation that "selects" a metric from a document according to some kind of ordering criteria on a second field. For example, you may want the most recent latency value within a date_histogram bucket: in this case, the "metric" is the
latency
field, and the ordering criteria istimestamp DESC, size: 1
.This is a fairly common use-case which is difficult to accomplish today.
top_hits
can give you the information, but it fetches an entire document and is not compatible with pipeline aggregations. It is also fairly expensive if many values/documents are being fetched. You can sometimes get the required information with clever usages of other aggs (like amax
agg, or scripting) to pull out the document you're looking for, but they are fragile and hacky approaches.The WeightedAvg agg added support for multiple ValuesSources, so a "metric selector" should not be too difficult to implement.
All naming is tentative, open to better suggestions! :)
Request Syntax
metric
sort
metric
byorder
sort
field? Ascending or descendingsize
<sort, metric>
tuples that should be returned1
multi_value_mode
metric
fields be collapsed into a single value?avg
Response
Note how the
sort
values are ordered descending per-bucket, and it returns a single metric value for each sort value. There may be 1000 documents in a bucket, but unlike other aggregations this actually returnsn
individual values from the documents themselves. If there are ties, there would be multiple objects with the samesort
.Misc
size
to prevent abuses. It should be fairly easy to track in a breaker, so that might be sufficient. I would feel better if there was a hard/soft limit though :) Like top_hits, this should be used to fetch a handful of values not an entire indexkeyword
, etc).asc
/desc
(e.g. the min or max values of a field), we shouldn't run into top-n accuracy issues liketerms
agg can have. Each shard will always send it'sn
min/max values and the coordinator will assemble a global min/max list. It might be that all topn
values have the same sort key and others are omitted, but this is not incorrect since we are displaying individual results and not grouping./cc @costin @colings86
The text was updated successfully, but these errors were encountered: