Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data.search.aggs] Support for multi-terms aggregation #77632

Closed
wylieconlon opened this issue Sep 16, 2020 · 10 comments
Closed

[data.search.aggs] Support for multi-terms aggregation #77632

wylieconlon opened this issue Sep 16, 2020 · 10 comments
Assignees
Labels
Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort NeededFor:VisEditors

Comments

@wylieconlon
Copy link
Contributor

wylieconlon commented Sep 16, 2020

Edit: This has now been implemented in Elasticsearch as a new aggregation called multi_terms.

For example, I want to be able to look at the top 10 CPU usage across each container, on each host name, on each datacenter. Because I only want the top 10 results, it's not possible to build this query without using scripted terms.

The ideal result of this query is the following table, which is not something that you can currently build with esaggs across three fields:

datacenter host name container id cpu (sorted 🔽)
us-east-1 f80a10 b91e34 55.20
us-east-2 6dd558 493fb5 42.00
us-west-2 ba4d4 9675e5 40.73
us-west-1 60413e 0574ef 39.92
us-east-1 f80a10 b91e34 36.21
us-east-2 6dd558 47b9c81 34.58
us-west-1 ba4d4 9675e5 24.01
us-east-1 f80a10 b91e34 22.40
us-east-2 6dd558 493fb5 15.91
us-west-1 27380b f44031 12.33

The terms documentation describes the tradeoffs with using scripted terms.

User input limitations

  • The user should be warned that this will use scripting and won't be as performant
  • When a single field is selected, this should be a standard terms aggregation
  • Scripted fields can't be used in the multi-field options because we don't have a safe way to concatenate scripts
  • Fields that aren't aggregatable, like full text fields, could technically be supported, but should not be for performance reasons

Implementation proposal

There are two possible implementations on the aggconfig level:

a) Don't deprecate the field parameter, but create a new fields parameter on the terms aggregation, with a new editor component. Introduce a new write function on the aggconfig to write the correct field or script as needed in the configuration.

b) Introduce the concept of multi-select on all fields in aggconfig, and change from field: string to field: string | string[]. Scripted fields are already support for all aggregations, but are most useful for the Terms agg.

The other thing that needs to change is the tabify logic. Currently each aggconfig is converted into a single column, but with this proposal I would want to convert a single aggconfig into multiple columns.

Example requests

In this example, I want to see the top 5 pairs of geo.src and geo.dest based on the maximum transferred bytes between those pairs. This type of query is useful for finding outliers.

I can express this query as ES SQL:

POST _sql?format=csv
{
  "query": """
  SELECT "geo.src", "geo.dest", MAX(bytes) FROM "kibana_sample_data_logs"
  GROUP BY "geo.src", "geo.dest" ORDER BY MAX(bytes) DESC LIMIT 5
  """
}

Which returns the following result:

geo.src geo.dest MAX(bytes)
IR MM 19986.0
IN PY 19956.0
CN TH 19929.0
MM US 19919.0
IN BD 19904.0

I can also express it as DSL:

POST kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "top": {
      "terms": {
        "script": {
          "lang": "painless",
          "source": "doc['geo.src'].value + '---' + doc['geo.dest'].value"
        },
        "order": {
          "bytes": "desc"
        }, 
        "size": 5
      },
      "aggs": {
        "bytes": {
          "max": {
            "field": "bytes"
          }
        }
      }
    }
  }
}

The DSL query returns:

{
  "took" : 86,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "top" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 13990,
      "buckets" : [
        {
          "key" : "IR---MM",
          "doc_count" : 1,
          "bytes" : {
            "value" : 19986.0
          }
        },
        {
          "key" : "IN---PY",
          "doc_count" : 1,
          "bytes" : {
            "value" : 19956.0
          }
        },
        {
          "key" : "CN---TH",
          "doc_count" : 19,
          "bytes" : {
            "value" : 19929.0
          }
        },
        {
          "key" : "MM---US",
          "doc_count" : 6,
          "bytes" : {
            "value" : 19919.0
          }
        },
        {
          "key" : "IN---BD",
          "doc_count" : 57,
          "bytes" : {
            "value" : 19904.0
          }
        }
      ]
    }
  }
}

Notice that in the DSL query, I had to separate each term using a separator parameter. I would expect that this separator has a default implementation, but might be user-configurable.

@wylieconlon wylieconlon added Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) Team:AppArch labels Sep 16, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@imotov
Copy link
Contributor

imotov commented Sep 16, 2020

The tricky part here is selecting the separator to make sure it cannot exist in the concatenated fields otherwise you encounter conflicts while concatenating fields. In the past we have discussed supporting this use case directly in elasticsearch elastic/elasticsearch#23818 As far as I remember, there were a few obstacles like supporting multi value fields. Perhaps, we can revisit it again. @polyfractal WDYT?

@polyfractal
Copy link
Contributor

polyfractal commented Sep 17, 2020

Igor and I chatted about this offline yesterday. A few notes for posterity:

  • To implement this correctly (e.g. without a user-provided delimiters) we need modify the internal classes to operate on sorted sets rather than single values. This would be a relatively huge amount of work :) We have internal specializations for numerics, strings, ordinals, etc and all of them expect single values
  • Alternatively, this could all be syntactic sugar for some painless scripts which we use/generate internally. Still have to find a way to handle delimiters correctly
  • We could also "downconvert" everything and throw it into the map string specialization. This essentially the same as the script method, except it bypasses the need to actually use painless. I suspect this would be easier to build and maintain too. Same performance profile as scripting though (slower, more memory, numerics look like strings, etc). Same delimiter issue
  • Multivalued fields are... weird.
    • We could simply concat multi-valued fields together and use that as the field's contribution to the final tuple. Most doc values are sorted (although some are also de-duped) so it would be a deterministic process... but might be confusing to the user seeing all the multi-values together
    • We could alternatively generate the full matrix of combinations. Could create a lot of buckets and also not the most pleasant coding task :)

In the "concat as string" cases, we also have to worry about how different datatypes are handled now that everything looks like a string. E.g. (string) "123" in one shard but (long)123 in another, and (double)123.0 in a third. Do we treat these as the same? add type suffixes (123d)?

@mitar
Copy link

mitar commented Dec 2, 2020

In the "concat as string" cases, we also have to worry about how different datatypes are handled now that everything looks like a string.

Why not simply encode them as a JSON array? Or some other compact/deterministic encoding? So example above would be then ["IR", "MM"]?

@wylieconlon
Copy link
Contributor Author

@imotov it looks like support for this was added to Elasticsearch in elastic/elasticsearch#67597, documented here, which is exciting!

Let's shift this issue to focus on how the integration with Kibana aggConfigs would work. I think the work needed is:

  • Introduce the fields parameter:
    • Array, not single value. Same validation as field
    • UI for fields in the vis_default_editor
  • Tabify support:
    • Each multi-term aggregation will generate a separate column for each field, which has never been done in tabify
    • Lens requires predictable IDs, so we should agree on a pattern for the ID generation of multiple columns. The current pattern of col-0-0 won't work, but maybe we can add a suffix like col-0-0[field.name] or col-0-0-0 to indicate the multi-term columns.
      • Another option is to have all the columns share the same ID, but I think this will break downstream usage by Lens

cc @flash1293 @ppisljar

@wylieconlon wylieconlon changed the title [data.search.aggs] Terms agg should allow multiple fields to be concatenated together using scripting [data.search.aggs] Support for multi-terms aggregation Mar 22, 2021
@flash1293
Copy link
Contributor

Each multi-term aggregation will generate a separate column for each field, which has never been done in tabify

I'm not sure whether we want to have separate columns or a single column, somehow joining the terms of all fields. I guess it depends on how we are planning to show this feature to the user. Do we want to have a field multi select in Lens? If yes, I think a single column is more consistent (at least it's more in line with my own mental model of how the Lens UI works). If we want to merge multiple dimensions into a single multi field terms agg behind the scenes, multiple columns make sense to me.

In both cases we would need to answer a bunch of questions of how it will integrate with other things (how does formatting work, how does filtering work, how do we communicate the difference to nested "top values")

@wylieconlon
Copy link
Contributor Author

wylieconlon commented Mar 22, 2021

@flash1293 I've made some assumptions here that might not be clear. Do you share these assumptions?

  • Merging multiple dimensions together is a harder UX problem, it simplifies the UX to have a multi-field selector first.
  • To display the chart, we need a data structure that contains:
    • The source aggregation type, multi_terms
    • For each field, the field name & field formatter
  • For rendering in a bar chart or pie chart, we will need to concatenate all the values together in the legend & tooltips
  • For rendering in a table, we have more space and therefore should render each field as a separate column

In my previous comment I assumed that tabify needs to create multiple columns, one for each field, with a predictable ID. By writing out my assumptions here I think that's not a requirement, but we would need a more-complex cell structure to compensate. Two examples:

  1. If each multi_term field gets its own column:
Multi-term field1 (String) Multi-term field2 (IP address) Count
Raw string value Raw numeric value 100
  1. If each multi_term field is combined into a single column:
Multi-term (fields: [field1, field2], formatters: [String, IP address]) Count
{ keys: [Raw string value, Raw numeric value] } 100

Now that I've written it out, I think option 2 is totally fine. This would simplify the logic needed in tabify, but it increases the complexity of the Lens rendering code.

@flash1293
Copy link
Contributor

flash1293 commented Mar 22, 2021

@wylieconlon

I think we have a similar situation with ranges - they are more complex than a single number but there might be a formatter defined already for just a single number we want to use (kind of higher order formatter). I think we can do the same thing here.

The cell contains:

{
  field1: "term1",
  field2: "term2",
  field3: "term3",
}

(array would be possible as well)

The formatter params look like this:

{
 id:  "multi_field",
 params: {
  subFormatters: {
    field1: {
      id: "string",
    },
   field2: {
      id: "url", //or whatever
      params: { /* ... */ }
    },
    field3: ...
  },
  concatSymbol: "|"
 }
}

The Lens rendering code has to work with a non-primitive type (something which is true today already because of ranges), and can just pass the object to the higher order formatter which will take care of the rest. Filtering can work in the same way - we just pass the cell value object as is, the filter building logic turns it into 3 filters (one per field) and show the multi field modal.

I mostly agree with your assumptions, the only thing I'm not sure about is the table rendering - if we have three columns instead of one, sorting via the column action popover becomes confusing, as well as hiding the column and other column based settings. We would need special logic for that in all places we keep state about a "column". IMHO the advantages of that are not worth the additional UI complexity (it introduces lots of edge cases like what if the user adds three fields, changes the widths of each of them to something else, then deletes of the fields).

@ppisljar
Copy link
Member

ppisljar commented Mar 23, 2021

Looking at the es documentation, what @flash1293 is suggesting makes sense to me

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 21, 2021
@flash1293 flash1293 self-assigned this Nov 1, 2021
@flash1293
Copy link
Contributor

Fixed by #116928

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort NeededFor:VisEditors
Projects
None yet
Development

No branches or pull requests

8 participants