Metric Selector aggregation #48069

polyfractal · 2019-10-15T15:22:49Z

I'd like to propose an aggregation that "selects" a metric from a document according to some kind of ordering criteria on a second field. For example, you may want the most recent latency value within a date_histogram bucket: in this case, the "metric" is the latency field, and the ordering criteria is timestamp DESC, size: 1.

This is a fairly common use-case which is difficult to accomplish today. top_hits can give you the information, but it fetches an entire document and is not compatible with pipeline aggregations. It is also fairly expensive if many values/documents are being fetched. You can sometimes get the required information with clever usages of other aggs (like a max agg, or scripting) to pull out the document you're looking for, but they are fragile and hacky approaches.

The WeightedAvg agg added support for multiple ValuesSources, so a "metric selector" should not be too difficult to implement.

All naming is tentative, open to better suggestions! :)

Request Syntax

GET _search
{
  "aggs": {
    "timeline": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "hour"
      },
      "aggs": {
        "most_recent": {
          "metric_selector": {
            "metric": {
              "field": "latency",
              // "script": ...
              // "format": ...,
              // "value_type": ...
            },
            "sort": {
              "field": "date",
              // "script": ...
              // "format": ...,
              // "value_type": ...
            },
            "order": "asc | desc",
            "size": 1,
            "multi_value_mode": "min | max | sum | avg"
          }
        }
      }
    }
  }
}

Parameter	Description
`metric`	The metric field that we wish to extract from a document	Required
`sort`	The field that we wish to sort and select the `metric` by	Required
`order`	How we should order the `sort` field? Ascending or descending	Required
`size`	The number of `<sort, metric>` tuples that should be returned	Optional, default: `1`
`multi_value_mode`	How should multi-valued `metric` fields be collapsed into a single value?	Optional: default `avg`

Response

{
  "aggregations" : {
    "timeline" : {
      "buckets" : [
        {
          "key_as_string" : "2019-01-01T05:00:00.000Z",
          "key" : 1546318800000,
          "doc_count" : 3,
          "most_recent" : [
            {
              "sort": 1546340340000,
              "sort_as_string": "2019-01-01T05:59:00.000Z",
              "value": 123
            },
            {
              "sort": 1546338600000,
              "sort": "2019-01-01T05:30:00.000Z",
              "value": 19
            }
          ]
        },
        {
          "key_as_string" : "2019-01-01T06:00:00.000Z",
          "key" : 1546322400000,
          "doc_count" : 1,
          "most_recent" : [
            {
              "sort": 1546341000000,
              "sort": "2019-01-01T06:10:00.000Z",
              "value": 9999
            },
            {
              "sort": 1546340700000,
              "sort": "2019-01-01T06:05:00.000Z",
              "value": 2233
            }
          ]
        }
      ]
    }
  }
}

Note how the sort values are ordered descending per-bucket, and it returns a single metric value for each sort value. There may be 1000 documents in a bucket, but unlike other aggregations this actually returns n individual values from the documents themselves. If there are ties, there would be multiple objects with the same sort.

Misc

I have a crude prototype which demonstrates the feasibility.
We will need some kind of limit on size to prevent abuses. It should be fairly easy to track in a breaker, so that might be sufficient. I would feel better if there was a hard/soft limit though :) Like top_hits, this should be used to fetch a handful of values not an entire index
We should support sorting on non-numeric fields too (keyword, etc).
I'm less sure we need to support non-numeric metrics. I think starting with numerics-only is fine
We can probably optimize the no-parent scenario with a BKD lookup similar to how min/max work today. Not necessary for the first iteration
As long as we only support asc/desc (e.g. the min or max values of a field), we shouldn't run into top-n accuracy issues like terms agg can have. Each shard will always send it's n min/max values and the coordinator will assemble a global min/max list. It might be that all top n values have the same sort key and others are omitted, but this is not incorrect since we are displaying individual results and not grouping.

/cc @costin @colings86

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-15T15:22:50Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2020-01-13T17:24:54Z

Potential naming idea: top_metric or similar, to parallel top_hits. Both signals that it has similar functionality, and doesn't confuse with bucket_selector which is quite different.

$@polyfractal$ polyfractal added >feature :Analytics/Aggregations Aggregations labels Oct 15, 2019

not-napoleon mentioned this issue Nov 26, 2019

Refactor ValuesSource and related classes #42949

Closed

85 tasks

nik9000 mentioned this issue Jan 10, 2020

New bucket aggregation - first/last of a date field. #50864

Closed

nik9000 mentioned this issue Jan 17, 2020

Implement top_metrics agg #51155

Merged

nik9000 closed this as completed in #51155 Feb 14, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

codebrain mentioned this issue Apr 14, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric Selector aggregation #48069

Metric Selector aggregation #48069

polyfractal commented Oct 15, 2019

elasticmachine commented Oct 15, 2019

polyfractal commented Jan 13, 2020

Metric Selector aggregation #48069

Metric Selector aggregation #48069

Comments

polyfractal commented Oct 15, 2019

Request Syntax

Response

Misc

elasticmachine commented Oct 15, 2019

polyfractal commented Jan 13, 2020