Add multi terms aggregation feature #1629

wstejka · 2021-11-30T13:42:25Z

Is your feature request related to a problem? Please describe.

I'd like to have an aggregation feature that lets me sort by a number of a document or a metric aggregation on a composite key and get top N results.
This feature was already implemented in the 7.15 version of Elastic Search.
Here you go link to the documentation:

https://www.elastic.co/guide/en/elasticsearch//reference/master/search-aggregations-bucket-multi-terms-aggregation.html

Describe the solution you'd like
It would be perfect to see this feature as a part of the official solution (as a service). Currently, we would have to instantiate (ES 7.15) and maintain it as a container ourselves.

Describe alternatives you've considered
As an alternative, we've considered some custom implementation of this mechanism on the consumer side (our service) but it seems ineffective. Moreover, it duplicates already existing features and has no sense to do it in that way.

code-xhyun · 2021-12-27T10:48:09Z

When will this feature get added?

dblock · 2021-12-27T16:38:41Z

When will this feature get added?

AFAIK nobody is working on this, please contribute / raise your hand if you are.

penghuo · 2022-01-19T17:56:19Z

SQL/PPL require this feature also. opensearch-project/sql#124.

rafael-gumiero · 2022-03-22T16:45:39Z

Any updates on this feature?

penghuo · 2022-03-22T17:02:34Z

Working on PR now. Give a quick demo of the feature now and will post first version soon.

Request

GET localhost:9200/test_00001/_search
{
  "size": 0, 
  "aggs": {
    "hot": {
      "multi-terms": {
        "terms": [{
          "field": "region" 
        },{
          "field": "host" 
        }],
        "order": {"max-cpu": "desc"}
      },
      "aggs": {
        "max-cpu": { "max": { "field": "cpu" } }
      }      
    }
  }
}

Response

{
  "took": 118,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "multi-terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": [
            "dub",
            "h1"
          ],
          "key_as_string": "dub|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 90.0
          }
        },
        {
          "key": [
            "dub",
            "h2"
          ],
          "key_as_string": "dub|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 70.0
          }
        },
        {
          "key": [
            "iad",
            "h2"
          ],
          "key_as_string": "iad|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 50.0
          }
        },
        {
          "key": [
            "iad",
            "h1"
          ],
          "key_as_string": "iad|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 15.0
          }
        }
      ]
    }
  }
}

msfidelis · 2022-03-22T18:24:40Z

🥳 🥳 🥳

qcoumes · 2022-03-23T17:20:23Z

Nice, can't wait for this feature

penghuo · 2022-04-27T18:26:04Z

1.Performance Improvement

As described in #2687, multi_terms aggregation is 20x slower than terms aggregation. After profiling, we found that encode is major contributor.

Why encode is major contributor?
- Serialize List<Object> as BytesRef take a lot of CPU time. Which is the potential optimization point.
Why not decode?
- Decode is execute only on each bucket key, encode is executed on each document. for example, there are 1M docs, but only 10 bucket key. encode will execute 1M times, but decode only execute 10 times.

2.Experiments

Use hasCode() to generate bucket key, instead of serialize entire List<Object>
Use globalOrdinal for string term.

2.1.Benchmark the average time of hashCode vs Encode.

Test with integer value, hashCode() is around 30x faster then encode().

  @Benchmark
  public void encode(Blackhole bh) {
    for (int i = 0; i < iterations; i++) {
      bh.consume(encode(Arrays.asList(random.nextInt(32), random.nextInt(32))));
    }
  }

  @Benchmark
  public void hash(Blackhole bh) {
    for (int i = 0; i < iterations; i++) {
      bh.consume(Arrays.asList(random.nextInt(32), random.nextInt(32)).hashCode());
    }
  }

  private static BytesRef encode(List<Object> values) {
    try (BytesStreamOutput output = new BytesStreamOutput()) {
      output.writeCollection(values, StreamOutput::writeGenericValue);
      return output.bytes().toBytesRef();
    } catch (IOException e) {
      throw ExceptionsHelper.convertToRuntime(e);
    }
  }

Benchmark             (iterations)  Mode  Cnt   Score   Error  Units
HashBenchmark.encode          1000  avgt    5   0.304 ± 0.006  ms/op
HashBenchmark.encode         10000  avgt    5   3.020 ± 0.071  ms/op
HashBenchmark.encode        100000  avgt    5  32.215 ± 4.714  ms/op
HashBenchmark.hash            1000  avgt    5   0.020 ± 0.001  ms/op
HashBenchmark.hash           10000  avgt    5   0.196 ± 0.019  ms/op
HashBenchmark.hash          100000  avgt    5   1.931 ± 0.083  ms/op

2.2 Test the performance of globalOrdinal and hashCode

Do a POC to verify the idea and test the performance. In general, we are 4x faster then existing multi_terms aggregation implementation.

# 1. multi_terms aggregation of keyword and integer field
time curl --request GET   --url http://localhost:9200/logs-201998/_search?pretty   --header 'content-type: application/json'   --data '{"size":0,"aggs":{"hot":{"multi_terms":{"terms":[{"field":"clientip"},{"field":"status"}],"order":{"avg-size":"desc"}},"aggs":{"avg-size":{"avg":{"field":"size"}}}}}}'

real	0m1.859s

# 1. multi_terms aggregation of integer and integer field
time curl --request GET   --url http://localhost:9200/logs-201998/_search?pretty   --header 'content-type: application/json'   --data '{"size":0,"aggs":{"hot":{"multi_terms":{"terms":[{"field":"status"},{"field":"status"}]}}}}'

real	0m0.653s

anirudha · 2022-05-20T18:34:17Z

we need to see how to - showcase visualizations and UI changes in visualize

reta · 2022-05-20T19:43:22Z

@penghuo my apologies if I am missing something, but I suspect the idea to use hashCode() relies on the fact that it returns an unique value for each key. This is particularly not true for Java's String impelementation, simple example:

jshell> "Ca".hashCode()
$5 ==> 2174

jshell> "DB".hashCode()
$6 ==> 2174

There are some in depth details in here [1] but we should not rely on hash code uniqueness [2].

[1] https://dzone.com/articles/what-is-wrong-with-hashcode-in-javalangstring
[2] https://github.com/penghuo/OpenSearch/blob/640b4a8bcbbf5a70fd990c711d6d3f97fb5da73a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java#L221

penghuo · 2022-06-06T16:01:42Z

@penghuo my apologies if I am missing something, but I suspect the idea to use hashCode() relies on the fact that it returns an unique value for each key. This is particularly not true for Java's String impelementation, simple example:
jshell> "Ca".hashCode()
$5 ==> 2174

jshell> "DB".hashCode()
$6 ==> 2174
There are some in depth details in here [1] but we should not rely on hash code uniqueness [2].

[1] https://dzone.com/articles/what-is-wrong-with-hashcode-in-javalangstring [2] https://github.com/penghuo/OpenSearch/blob/640b4a8bcbbf5a70fd990c711d6d3f97fb5da73a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java#L221

Good point. Instead of using String.hashcode for string, I think using globalOrdinal for string value. But not verified yet.

saratvemulapalli · 2022-06-28T15:45:31Z

@penghuo this issue is tagged 2.1.0.
We are code freeze for 2.1. I'll move this issue to 2.2.0, let me know if you think otherwise.

joshuali925 · 2022-07-01T17:13:21Z

The two PRs mentioned this issue were backported to 2.x before 2.1 was cut. The backport commits 32cfe53 and c7656f1 are already in 2.1 branch

Rishikesh1159 · 2022-07-01T17:49:00Z

Yes as @joshuali925 mentioned this feature is present in feature 2.1 branch. We will close this issue after 2.1 is released

wstejka added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 30, 2021

anasalkouz added Search:Aggregations and removed untriaged labels Nov 30, 2021

peterzhuamazon mentioned this issue Feb 17, 2022

[Enhancement] Allow user to define OPENSEARCH_SD_NOTIFY without hardcoding it to rpm/deb build #2073

Closed

anirudha assigned penghuo Mar 8, 2022

penghuo mentioned this issue Mar 31, 2022

Adding multi_term aggregator support #2687

Merged

5 tasks

penghuo mentioned this issue Apr 25, 2022

Correct the skip version, multi_terms aggregation is supported on 2.1 #3072

Merged

5 tasks

elfisher added the roadmap label May 4, 2022

brijos mentioned this issue May 18, 2022

[FEATURE] - Support for Multi Term Aggregation opensearch-project/documentation-website#582

Closed

rishabhmaurya mentioned this issue May 18, 2022

[Documentation] Document usecases for multi-term aggregation in bucket level monitor opensearch-project/alerting#453

Open

elfisher added the v2.1.0 Issues and PRs related to version 2.1.0 label May 20, 2022

hdhalter mentioned this issue Jun 6, 2022

Multi-terms aggregation feature opensearch-project/documentation-website#643

Closed

saratvemulapalli added v2.2.0 and removed v2.1.0 Issues and PRs related to version 2.1.0 labels Jun 28, 2022

Rishikesh1159 removed the v2.2.0 label Jul 1, 2022

Rishikesh1159 added the v2.1.0 Issues and PRs related to version 2.1.0 label Jul 1, 2022

anirudha mentioned this issue Jul 7, 2022

Add 2.1.0 release notes opensearch-project/opensearch-build#2302

Merged

saratvemulapalli closed this as completed Jul 19, 2022

penghuo mentioned this issue Jul 20, 2022

TimeSeries optimizations in OpenSearch #3734

Open

rishabhmaurya mentioned this issue May 5, 2023

[BUG] Add support for string key format in bucket level aggregations messages opensearch-project/alerting#798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi terms aggregation feature #1629

Add multi terms aggregation feature #1629

wstejka commented Nov 30, 2021

code-xhyun commented Dec 27, 2021

dblock commented Dec 27, 2021

penghuo commented Jan 19, 2022

rafael-gumiero commented Mar 22, 2022

penghuo commented Mar 22, 2022

msfidelis commented Mar 22, 2022

qcoumes commented Mar 23, 2022

penghuo commented Apr 27, 2022 •

edited

Loading

anirudha commented May 20, 2022

reta commented May 20, 2022 •

edited

Loading

penghuo commented Jun 6, 2022

saratvemulapalli commented Jun 28, 2022

joshuali925 commented Jul 1, 2022

Rishikesh1159 commented Jul 1, 2022 •

edited

Loading

Add multi terms aggregation feature #1629

Add multi terms aggregation feature #1629

Comments

wstejka commented Nov 30, 2021

code-xhyun commented Dec 27, 2021

dblock commented Dec 27, 2021

penghuo commented Jan 19, 2022

rafael-gumiero commented Mar 22, 2022

penghuo commented Mar 22, 2022

msfidelis commented Mar 22, 2022

qcoumes commented Mar 23, 2022

penghuo commented Apr 27, 2022 • edited Loading

1.Performance Improvement

2.Experiments

2.1.Benchmark the average time of hashCode vs Encode.

2.2 Test the performance of globalOrdinal and hashCode

anirudha commented May 20, 2022

reta commented May 20, 2022 • edited Loading

penghuo commented Jun 6, 2022

saratvemulapalli commented Jun 28, 2022

joshuali925 commented Jul 1, 2022

Rishikesh1159 commented Jul 1, 2022 • edited Loading

penghuo commented Apr 27, 2022 •

edited

Loading

reta commented May 20, 2022 •

edited

Loading

Rishikesh1159 commented Jul 1, 2022 •

edited

Loading