Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi terms aggregation feature #1629

Closed
wstejka opened this issue Nov 30, 2021 · 14 comments
Closed

Add multi terms aggregation feature #1629

wstejka opened this issue Nov 30, 2021 · 14 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.1.0 Issues and PRs related to version 2.1.0

Comments

@wstejka
Copy link

wstejka commented Nov 30, 2021

Is your feature request related to a problem? Please describe.

I'd like to have an aggregation feature that lets me sort by a number of a document or a metric aggregation on a composite key and get top N results.
This feature was already implemented in the 7.15 version of Elastic Search.
Here you go link to the documentation:

https://www.elastic.co/guide/en/elasticsearch//reference/master/search-aggregations-bucket-multi-terms-aggregation.html

Describe the solution you'd like
It would be perfect to see this feature as a part of the official solution (as a service). Currently, we would have to instantiate (ES 7.15) and maintain it as a container ourselves.

Describe alternatives you've considered
As an alternative, we've considered some custom implementation of this mechanism on the consumer side (our service) but it seems ineffective. Moreover, it duplicates already existing features and has no sense to do it in that way.

@wstejka wstejka added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 30, 2021
@code-xhyun
Copy link

When will this feature get added?

@dblock
Copy link
Member

dblock commented Dec 27, 2021

When will this feature get added?

AFAIK nobody is working on this, please contribute / raise your hand if you are.

@penghuo
Copy link
Contributor

penghuo commented Jan 19, 2022

SQL/PPL require this feature also. opensearch-project/sql#124.

@rafael-gumiero
Copy link

Any updates on this feature?

@penghuo
Copy link
Contributor

penghuo commented Mar 22, 2022

Working on PR now. Give a quick demo of the feature now and will post first version soon.

  • Request
GET localhost:9200/test_00001/_search
{
  "size": 0, 
  "aggs": {
    "hot": {
      "multi-terms": {
        "terms": [{
          "field": "region" 
        },{
          "field": "host" 
        }],
        "order": {"max-cpu": "desc"}
      },
      "aggs": {
        "max-cpu": { "max": { "field": "cpu" } }
      }      
    }
  }
}
  • Response
{
  "took": 118,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "multi-terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": [
            "dub",
            "h1"
          ],
          "key_as_string": "dub|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 90.0
          }
        },
        {
          "key": [
            "dub",
            "h2"
          ],
          "key_as_string": "dub|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 70.0
          }
        },
        {
          "key": [
            "iad",
            "h2"
          ],
          "key_as_string": "iad|h2",
          "doc_count": 2,
          "max-cpu": {
            "value": 50.0
          }
        },
        {
          "key": [
            "iad",
            "h1"
          ],
          "key_as_string": "iad|h1",
          "doc_count": 2,
          "max-cpu": {
            "value": 15.0
          }
        }
      ]
    }
  }
}

@msfidelis
Copy link

🥳 🥳 🥳

@qcoumes
Copy link

qcoumes commented Mar 23, 2022

Nice, can't wait for this feature

@penghuo
Copy link
Contributor

penghuo commented Apr 27, 2022

1.Performance Improvement

As described in #2687, multi_terms aggregation is 20x slower than terms aggregation. After profiling, we found that encode is major contributor.
Screen Shot 2022-04-21 at 8 50 40 PM

  • Why encode is major contributor?
    • Serialize List<Object> as BytesRef take a lot of CPU time. Which is the potential optimization point.
  • Why not decode?
    • Decode is execute only on each bucket key, encode is executed on each document. for example, there are 1M docs, but only 10 bucket key. encode will execute 1M times, but decode only execute 10 times.

2.Experiments

  • Use hasCode() to generate bucket key, instead of serialize entire List<Object>
  • Use globalOrdinal for string term.

2.1.Benchmark the average time of hashCode vs Encode.

Test with integer value, hashCode() is around 30x faster then encode().

  @Benchmark
  public void encode(Blackhole bh) {
    for (int i = 0; i < iterations; i++) {
      bh.consume(encode(Arrays.asList(random.nextInt(32), random.nextInt(32))));
    }
  }

  @Benchmark
  public void hash(Blackhole bh) {
    for (int i = 0; i < iterations; i++) {
      bh.consume(Arrays.asList(random.nextInt(32), random.nextInt(32)).hashCode());
    }
  }

  private static BytesRef encode(List<Object> values) {
    try (BytesStreamOutput output = new BytesStreamOutput()) {
      output.writeCollection(values, StreamOutput::writeGenericValue);
      return output.bytes().toBytesRef();
    } catch (IOException e) {
      throw ExceptionsHelper.convertToRuntime(e);
    }
  }
Benchmark             (iterations)  Mode  Cnt   Score   Error  Units
HashBenchmark.encode          1000  avgt    5   0.304 ± 0.006  ms/op
HashBenchmark.encode         10000  avgt    5   3.020 ± 0.071  ms/op
HashBenchmark.encode        100000  avgt    5  32.215 ± 4.714  ms/op
HashBenchmark.hash            1000  avgt    5   0.020 ± 0.001  ms/op
HashBenchmark.hash           10000  avgt    5   0.196 ± 0.019  ms/op
HashBenchmark.hash          100000  avgt    5   1.931 ± 0.083  ms/op

2.2 Test the performance of globalOrdinal and hashCode

Do a POC to verify the idea and test the performance. In general, we are 4x faster then existing multi_terms aggregation implementation.

# 1. multi_terms aggregation of keyword and integer field
time curl --request GET   --url http://localhost:9200/logs-201998/_search?pretty   --header 'content-type: application/json'   --data '{"size":0,"aggs":{"hot":{"multi_terms":{"terms":[{"field":"clientip"},{"field":"status"}],"order":{"avg-size":"desc"}},"aggs":{"avg-size":{"avg":{"field":"size"}}}}}}'

real	0m1.859s

# 1. multi_terms aggregation of integer and integer field
time curl --request GET   --url http://localhost:9200/logs-201998/_search?pretty   --header 'content-type: application/json'   --data '{"size":0,"aggs":{"hot":{"multi_terms":{"terms":[{"field":"status"},{"field":"status"}]}}}}'

real	0m0.653s

@anirudha
Copy link

we need to see how to - showcase visualizations and UI changes in visualize

@reta
Copy link
Collaborator

reta commented May 20, 2022

@penghuo my apologies if I am missing something, but I suspect the idea to use hashCode() relies on the fact that it returns an unique value for each key. This is particularly not true for Java's String impelementation, simple example:

jshell> "Ca".hashCode()
$5 ==> 2174

jshell> "DB".hashCode()
$6 ==> 2174

There are some in depth details in here [1] but we should not rely on hash code uniqueness [2].

[1] https://dzone.com/articles/what-is-wrong-with-hashcode-in-javalangstring
[2] https://github.com/penghuo/OpenSearch/blob/640b4a8bcbbf5a70fd990c711d6d3f97fb5da73a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java#L221

@penghuo
Copy link
Contributor

penghuo commented Jun 6, 2022

@penghuo my apologies if I am missing something, but I suspect the idea to use hashCode() relies on the fact that it returns an unique value for each key. This is particularly not true for Java's String impelementation, simple example:

jshell> "Ca".hashCode()
$5 ==> 2174

jshell> "DB".hashCode()
$6 ==> 2174

There are some in depth details in here [1] but we should not rely on hash code uniqueness [2].

[1] https://dzone.com/articles/what-is-wrong-with-hashcode-in-javalangstring [2] https://github.com/penghuo/OpenSearch/blob/640b4a8bcbbf5a70fd990c711d6d3f97fb5da73a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java#L221

Good point. Instead of using String.hashcode for string, I think using globalOrdinal for string value. But not verified yet.

@saratvemulapalli
Copy link
Member

@penghuo this issue is tagged 2.1.0.
We are code freeze for 2.1. I'll move this issue to 2.2.0, let me know if you think otherwise.

@saratvemulapalli saratvemulapalli added v2.2.0 and removed v2.1.0 Issues and PRs related to version 2.1.0 labels Jun 28, 2022
@joshuali925
Copy link
Member

The two PRs mentioned this issue were backported to 2.x before 2.1 was cut. The backport commits 32cfe53 and c7656f1 are already in 2.1 branch

@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Jul 1, 2022

Yes as @joshuali925 mentioned this feature is present in feature 2.1 branch. We will close this issue after 2.1 is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.1.0 Issues and PRs related to version 2.1.0
Projects
None yet
Development

No branches or pull requests