[data.search.aggs] Support for multi-terms aggregation #77632

wylieconlon · 2020-09-16T16:11:08Z

Edit: This has now been implemented in Elasticsearch as a new aggregation called multi_terms.

For example, I want to be able to look at the top 10 CPU usage across each container, on each host name, on each datacenter. Because I only want the top 10 results, it's not possible to build this query without using scripted terms.

The ideal result of this query is the following table, which is not something that you can currently build with esaggs across three fields:

datacenter	host name	container id	cpu (sorted 🔽)
us-east-1	f80a10	b91e34	55.20
us-east-2	6dd558	493fb5	42.00
us-west-2	ba4d4	9675e5	40.73
us-west-1	60413e	0574ef	39.92
us-east-1	f80a10	b91e34	36.21
us-east-2	6dd558	47b9c81	34.58
us-west-1	ba4d4	9675e5	24.01
us-east-1	f80a10	b91e34	22.40
us-east-2	6dd558	493fb5	15.91
us-west-1	27380b	f44031	12.33

The terms documentation describes the tradeoffs with using scripted terms.

User input limitations

The user should be warned that this will use scripting and won't be as performant
When a single field is selected, this should be a standard terms aggregation
Scripted fields can't be used in the multi-field options because we don't have a safe way to concatenate scripts
Fields that aren't aggregatable, like full text fields, could technically be supported, but should not be for performance reasons

Implementation proposal

There are two possible implementations on the aggconfig level:

a) Don't deprecate the field parameter, but create a new fields parameter on the terms aggregation, with a new editor component. Introduce a new write function on the aggconfig to write the correct field or script as needed in the configuration.

b) Introduce the concept of multi-select on all fields in aggconfig, and change from field: string to field: string | string[]. Scripted fields are already support for all aggregations, but are most useful for the Terms agg.

The other thing that needs to change is the tabify logic. Currently each aggconfig is converted into a single column, but with this proposal I would want to convert a single aggconfig into multiple columns.

Example requests

In this example, I want to see the top 5 pairs of geo.src and geo.dest based on the maximum transferred bytes between those pairs. This type of query is useful for finding outliers.

I can express this query as ES SQL:

POST _sql?format=csv
{
  "query": """
  SELECT "geo.src", "geo.dest", MAX(bytes) FROM "kibana_sample_data_logs"
  GROUP BY "geo.src", "geo.dest" ORDER BY MAX(bytes) DESC LIMIT 5
  """
}

Which returns the following result:

geo.src	geo.dest	MAX(bytes)
IR	MM	19986.0
IN	PY	19956.0
CN	TH	19929.0
MM	US	19919.0
IN	BD	19904.0

I can also express it as DSL:

POST kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "top": {
      "terms": {
        "script": {
          "lang": "painless",
          "source": "doc['geo.src'].value + '---' + doc['geo.dest'].value"
        },
        "order": {
          "bytes": "desc"
        }, 
        "size": 5
      },
      "aggs": {
        "bytes": {
          "max": {
            "field": "bytes"
          }
        }
      }
    }
  }
}

The DSL query returns:

{
  "took" : 86,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "top" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 13990,
      "buckets" : [
        {
          "key" : "IR---MM",
          "doc_count" : 1,
          "bytes" : {
            "value" : 19986.0
          }
        },
        {
          "key" : "IN---PY",
          "doc_count" : 1,
          "bytes" : {
            "value" : 19956.0
          }
        },
        {
          "key" : "CN---TH",
          "doc_count" : 19,
          "bytes" : {
            "value" : 19929.0
          }
        },
        {
          "key" : "MM---US",
          "doc_count" : 6,
          "bytes" : {
            "value" : 19919.0
          }
        },
        {
          "key" : "IN---BD",
          "doc_count" : 57,
          "bytes" : {
            "value" : 19904.0
          }
        }
      ]
    }
  }
}

Notice that in the DSL query, I had to separate each term using a separator parameter. I would expect that this separator has a default implementation, but might be user-configurable.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-16T16:11:10Z

Pinging @elastic/kibana-app-arch (Team:AppArch)

imotov · 2020-09-16T18:57:17Z

The tricky part here is selecting the separator to make sure it cannot exist in the concatenated fields otherwise you encounter conflicts while concatenating fields. In the past we have discussed supporting this use case directly in elasticsearch elastic/elasticsearch#23818 As far as I remember, there were a few obstacles like supporting multi value fields. Perhaps, we can revisit it again. @polyfractal WDYT?

polyfractal · 2020-09-17T18:47:35Z

Igor and I chatted about this offline yesterday. A few notes for posterity:

To implement this correctly (e.g. without a user-provided delimiters) we need modify the internal classes to operate on sorted sets rather than single values. This would be a relatively huge amount of work :) We have internal specializations for numerics, strings, ordinals, etc and all of them expect single values
Alternatively, this could all be syntactic sugar for some painless scripts which we use/generate internally. Still have to find a way to handle delimiters correctly
We could also "downconvert" everything and throw it into the map string specialization. This essentially the same as the script method, except it bypasses the need to actually use painless. I suspect this would be easier to build and maintain too. Same performance profile as scripting though (slower, more memory, numerics look like strings, etc). Same delimiter issue
Multivalued fields are... weird.
- We could simply concat multi-valued fields together and use that as the field's contribution to the final tuple. Most doc values are sorted (although some are also de-duped) so it would be a deterministic process... but might be confusing to the user seeing all the multi-values together
- We could alternatively generate the full matrix of combinations. Could create a lot of buckets and also not the most pleasant coding task :)

In the "concat as string" cases, we also have to worry about how different datatypes are handled now that everything looks like a string. E.g. (string) "123" in one shard but (long)123 in another, and (double)123.0 in a third. Do we treat these as the same? add type suffixes (123d)?

mitar · 2020-12-02T00:25:43Z

In the "concat as string" cases, we also have to worry about how different datatypes are handled now that everything looks like a string.

Why not simply encode them as a JSON array? Or some other compact/deterministic encoding? So example above would be then ["IR", "MM"]?

wylieconlon · 2021-03-22T15:31:52Z

@imotov it looks like support for this was added to Elasticsearch in elastic/elasticsearch#67597, documented here, which is exciting!

Let's shift this issue to focus on how the integration with Kibana aggConfigs would work. I think the work needed is:

Introduce the fields parameter:
- Array, not single value. Same validation as field
- UI for fields in the vis_default_editor
Tabify support:
- Each multi-term aggregation will generate a separate column for each field, which has never been done in tabify
- Lens requires predictable IDs, so we should agree on a pattern for the ID generation of multiple columns. The current pattern of col-0-0 won't work, but maybe we can add a suffix like col-0-0[field.name] or col-0-0-0 to indicate the multi-term columns.
  - Another option is to have all the columns share the same ID, but I think this will break downstream usage by Lens

cc @flash1293 @ppisljar

flash1293 · 2021-03-22T15:42:20Z

Each multi-term aggregation will generate a separate column for each field, which has never been done in tabify

I'm not sure whether we want to have separate columns or a single column, somehow joining the terms of all fields. I guess it depends on how we are planning to show this feature to the user. Do we want to have a field multi select in Lens? If yes, I think a single column is more consistent (at least it's more in line with my own mental model of how the Lens UI works). If we want to merge multiple dimensions into a single multi field terms agg behind the scenes, multiple columns make sense to me.

In both cases we would need to answer a bunch of questions of how it will integrate with other things (how does formatting work, how does filtering work, how do we communicate the difference to nested "top values")

wylieconlon · 2021-03-22T17:52:57Z

@flash1293 I've made some assumptions here that might not be clear. Do you share these assumptions?

Merging multiple dimensions together is a harder UX problem, it simplifies the UX to have a multi-field selector first.
To display the chart, we need a data structure that contains:
- The source aggregation type, multi_terms
- For each field, the field name & field formatter
For rendering in a bar chart or pie chart, we will need to concatenate all the values together in the legend & tooltips
For rendering in a table, we have more space and therefore should render each field as a separate column

In my previous comment I assumed that tabify needs to create multiple columns, one for each field, with a predictable ID. By writing out my assumptions here I think that's not a requirement, but we would need a more-complex cell structure to compensate. Two examples:

If each multi_term field gets its own column:

Multi-term field1 (String)	Multi-term field2 (IP address)	Count
Raw string value	Raw numeric value	100

If each multi_term field is combined into a single column:

Multi-term (fields: [field1, field2], formatters: [String, IP address])	Count
`{ keys: [Raw string value, Raw numeric value] }`	100

Now that I've written it out, I think option 2 is totally fine. This would simplify the logic needed in tabify, but it increases the complexity of the Lens rendering code.

flash1293 · 2021-03-22T18:07:26Z

@wylieconlon

I think we have a similar situation with ranges - they are more complex than a single number but there might be a formatter defined already for just a single number we want to use (kind of higher order formatter). I think we can do the same thing here.

The cell contains:

{
  field1: "term1",
  field2: "term2",
  field3: "term3",
}

(array would be possible as well)

The formatter params look like this:

{
 id:  "multi_field",
 params: {
  subFormatters: {
    field1: {
      id: "string",
    },
   field2: {
      id: "url", //or whatever
      params: { /* ... */ }
    },
    field3: ...
  },
  concatSymbol: "|"
 }
}

The Lens rendering code has to work with a non-primitive type (something which is true today already because of ranges), and can just pass the object to the higher order formatter which will take care of the rest. Filtering can work in the same way - we just pass the cell value object as is, the filter building logic turns it into 3 filters (one per field) and show the multi field modal.

I mostly agree with your assumptions, the only thing I'm not sure about is the table rendering - if we have three columns instead of one, sorting via the column action popover becomes confusing, as well as hiding the column and other column based settings. We would need special logic for that in all places we keep state about a "column". IMHO the advantages of that are not worth the additional UI complexity (it introduces lots of edge cases like what if the user adds three fields, changes the widths of each of them to something else, then deletes of the fields).

ppisljar · 2021-03-23T05:43:45Z

Looking at the es documentation, what @flash1293 is suggesting makes sense to me

flash1293 · 2021-12-09T11:11:10Z

Fixed by #116928

wylieconlon added Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) Team:AppArch labels Sep 16, 2020

This was referenced Sep 16, 2020

[Meta] Kibana support for ES aggregations #58628

Closed

[Lens] Server side sortable table columns while in edit mode #55905

Closed

imotov mentioned this issue Nov 30, 2020

Add support for multi-field keys to terms aggs elastic/elasticsearch#65623

Closed

wylieconlon added the NeededFor:KibanaApp label Mar 22, 2021

wylieconlon changed the title ~~[data.search.aggs] Terms agg should allow multiple fields to be concatenated together using scripting~~ [data.search.aggs] Support for multi-terms aggregation Mar 22, 2021

flash1293 mentioned this issue Mar 22, 2021

[Lens] Allow multi field "top values" #95079

Closed

exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 21, 2021

timroes added NeededFor:VisEditors and removed NeededFor:KibanaApp labels Aug 31, 2021

flash1293 self-assigned this Nov 1, 2021

flash1293 closed this as completed Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data.search.aggs] Support for multi-terms aggregation #77632

[data.search.aggs] Support for multi-terms aggregation #77632

wylieconlon commented Sep 16, 2020 •

edited

Loading

elasticmachine commented Sep 16, 2020

imotov commented Sep 16, 2020

polyfractal commented Sep 17, 2020 •

edited

Loading

mitar commented Dec 2, 2020

wylieconlon commented Mar 22, 2021

flash1293 commented Mar 22, 2021

wylieconlon commented Mar 22, 2021 •

edited

Loading

flash1293 commented Mar 22, 2021 •

edited

Loading

ppisljar commented Mar 23, 2021 •

edited

Loading

flash1293 commented Dec 9, 2021

[data.search.aggs] Support for multi-terms aggregation #77632

[data.search.aggs] Support for multi-terms aggregation #77632

Comments

wylieconlon commented Sep 16, 2020 • edited Loading

User input limitations

Implementation proposal

Example requests

elasticmachine commented Sep 16, 2020

imotov commented Sep 16, 2020

polyfractal commented Sep 17, 2020 • edited Loading

mitar commented Dec 2, 2020

wylieconlon commented Mar 22, 2021

flash1293 commented Mar 22, 2021

wylieconlon commented Mar 22, 2021 • edited Loading

flash1293 commented Mar 22, 2021 • edited Loading

ppisljar commented Mar 23, 2021 • edited Loading

flash1293 commented Dec 9, 2021

wylieconlon commented Sep 16, 2020 •

edited

Loading

polyfractal commented Sep 17, 2020 •

edited

Loading

wylieconlon commented Mar 22, 2021 •

edited

Loading

flash1293 commented Mar 22, 2021 •

edited

Loading

ppisljar commented Mar 23, 2021 •

edited

Loading