Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand/Improve the performance of date histograms #9310

Closed
jainankitk opened this issue Aug 14, 2023 · 27 comments · Fixed by #11083
Closed

Understand/Improve the performance of date histograms #9310

jainankitk opened this issue Aug 14, 2023 · 27 comments · Fixed by #11083
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs Search:Aggregations v2.12.0 Issues and PRs related to version 2.12.0

Comments

@jainankitk
Copy link
Collaborator

jainankitk commented Aug 14, 2023

I have been looking at understanding and improving the performance of date histograms in Opensearch. For this purpose, I have setup single node cluster, and ingested nyc_taxis dataset.

While running single histogram aggregation request in loop, I collected the following cpu flamegraph:

Screenshot 2023-08-14 at 12 50 18 PM

Trying to understand how can we reduce the scoreRange cost for this request

@jainankitk jainankitk added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 14, 2023
@jainankitk
Copy link
Collaborator Author

jainankitk commented Aug 17, 2023

I did some more experiments running histogram query on one node setup and tracking the took times:

$ curl -s -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2016-01-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}' | head
{
  "took" : 8822,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {

Initially, I rewrote this as cardinality filter query and took time was significantly lesser for one month:

$ curl -s -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-12-01 00:00:00","lt": "2015-12-31 00:00:00"}}}}' | head
{
  "took" : 162,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {

Even though the took for one month is much lesser (and overall query latency can be significantly reduced), I was not too keen on this approach as it had very limited applications (only cardinality aggregations without aggregation composition)

@jainankitk
Copy link
Collaborator Author

jainankitk commented Aug 17, 2023

While going through the lucene code, I noticed that the critical path is date histogram collector which is very iterative in nature. But, it is very efficient at computing results for many levels of composition with single iteration. Hence, I came up with another way of rewriting the query applicable to more histogram aggregation queries with following properties:

  • Leading scorer is date range filter
  • Top level aggregation is date range

Key idea is to slice the date range itself into multiple queries using rewrite. The response transformation is quite simple - union over disjoint sets. Sharing some results below (doing the rewrite on client instead of server):

Partitioning into 2 slices:

$ curl -s -X GET "localhost:9200/nyc_taxis/_msearch?pretty" -H 'Content-Type: application/json' -d'
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-06-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-07-01 00:00:00","lt": "2015-12-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> ' | head
{
  "took" : 4225,
  "responses" : [
    {
      "took" : 4224,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,

Partitioning into 4 slices:

$ curl -s -X GET "localhost:9200/nyc_taxis/_msearch?pretty" -H 'Content-Type: application/json' -d'
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-03-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-04-01 00:00:00","lt": "2015-06-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-07-01 00:00:00","lt": "2015-09-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-10-01 00:00:00","lt": "2015-12-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> '  | head
{
  "took" : 3588,
  "responses" : [
    {
      "took" : 3583,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,

Partitioning into 12 slices:

$ curl -s -X GET "localhost:9200/nyc_taxis/_msearch?pretty" -H 'Content-Type: application/json' -d'
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-01-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-02-01 00:00:00","lt": "2015-02-28 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-03-01 00:00:00","lt": "2015-03-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-04-01 00:00:00","lt": "2015-04-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-05-01 00:00:00","lt": "2015-05-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-06-01 00:00:00","lt": "2015-06-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-07-01 00:00:00","lt": "2015-07-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-08-01 00:00:00","lt": "2015-08-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-09-01 00:00:00","lt": "2015-09-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-10-01 00:00:00","lt": "2015-10-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00","lt": "2015-11-30 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-12-01 00:00:00","lt": "2015-12-31 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"}}}}
> ' | head
{
  "took" : 3291,
  "responses" : [
    {
      "took" : 1813,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,

@jainankitk
Copy link
Collaborator Author

jainankitk commented Aug 17, 2023

As we can observe from took times above, the most reduction comes from 1 to 2 slices (~50%), increasing the number of slices further does not see as much jump. The limitation from single node compute might also be coming into play

If we combine the slicing approach and cardinality specific rewrite, the reduction is more than 80% in latency. Although, that approach does not give any flexibility on the number of slices and limited to specific cardinality query like above. Sample invocation:

$ curl -s -X GET "localhost:9200/nyc_taxis/_msearch?pretty" -H 'Content-Type: application/json' -d'
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-01-31 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-02-01 00:00:00","lt": "2015-02-28 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-03-01 00:00:00","lt": "2015-03-31 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-04-01 00:00:00","lt": "2015-04-30 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-05-01 00:00:00","lt": "2015-05-31 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-06-01 00:00:00","lt": "2015-06-30 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-07-01 00:00:00","lt": "2015-07-31 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-08-01 00:00:00","lt": "2015-08-31 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-09-01 00:00:00","lt": "2015-09-30 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-10-01 00:00:00","lt": "2015-10-31 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00","lt": "2015-11-30 00:00:00"}}}}
> {}
> {"size": 0,"track_total_hits": "true","query": {"range": {"dropoff_datetime": {"gte": "2015-12-01 00:00:00","lt": "2015-12-31 00:00:00"}}}}
> ' | head
{
  "took" : 993,
  "responses" : [
    {
      "took" : 551,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,

@kkhatua kkhatua added the Performance This is for any performance related enhancements or bugs label Aug 17, 2023
@jainankitk
Copy link
Collaborator Author

jainankitk commented Aug 17, 2023

Applying the same slicing concept on composite date histogram aggregation, we can see similar improvement (26s to 14s):

$ curl -s -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2016-01-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"},"aggs":{"total_amount":{"stats":{"field":"total_amount"}},"tip_amount":{"stats":{"field":"tip_amount"}},"trip_distance":{"stats":{"field":"trip_distance"}}}}}}' | head
{
  "took" : 26604,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
$ curl -s -X GET "localhost:9200/nyc_taxis/_msearch?pretty" -H 'Content-Type: application/json' -d'
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-07-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"},"aggs":{"total_amount":{"stats":{"field":"total_amount"}},"tip_amount":{"stats":{"field":"tip_amount"}},"trip_distance":{"stats":{"field":"trip_distance"}}}}}}
> {}
> {"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-07-01 00:00:00","lt": "2016-01-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram": {"field": "dropoff_datetime","calendar_interval": "month"},"aggs":{"total_amount":{"stats":{"field":"total_amount"}},"tip_amount":{"stats":{"field":"tip_amount"}},"trip_distance":{"stats":{"field":"trip_distance"}}}}}}
> ' | head
{
  "took" : 14806,
  "responses" : [
    {
      "took" : 14806,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,

@backslasht
Copy link
Contributor

Thanks @jainankitk for the details.

While this approach of split (into multiple smaller range queries) and merge (into one result) looks promising, I think the net CPU utilization will remain the same. Or in other words, in a busy system where are all search threads are (nearly) occupied, the performance benefit may not be seen.

Also, is there any overhead caused by the number of splits, for example, consider a date aggregation query where the results are aggregated at a month level for an year and say the data is split across 20 shards in 20 different nodes. In that case, instead of 20 network call (between the nodes), this would make 20*12 network calls and there may be some overhead with respect of serializing/de-serializing the request/responses.

Have you tried this with a setup containing more number of nodes?

@msfroh
Copy link
Collaborator

msfroh commented Aug 22, 2023

@jainankitk How do you think this would compare to using concurrent search across two slices? Unless I'm mistaken, the drop in latency comes from running the two slices in parallel in the _msearch versus running them on a single thread per shard.

Using concurrent search, we would mostly end up doing the same thing, especially if segments hold disjoint contiguous time ranges (which isn't guaranteed, but can be made more likely with #9241).

@jainankitk
Copy link
Collaborator Author

@jainankitk How do you think this would compare to using concurrent search across two slices?

The impact will be similar.

Using concurrent search, we would mostly end up doing the same thing, especially if segments hold disjoint contiguous time ranges (which isn't guaranteed, but can be made more likely with #9241).

While the benefits are similar, this approach is more generic (for date histogram aggregations) as it is agnostic of number of segments. Also, the rewrite using track_total_hits (for cardinality date histogram aggregtions) makes the request scorer bound instead of collector, which is significant improvement

@jainankitk
Copy link
Collaborator Author

While this approach of split (into multiple smaller range queries) and merge (into one result) looks promising, I think the net CPU utilization will remain the same. Or in other words, in a busy system where are all search threads are (nearly) occupied, the performance benefit may not be seen.

@backslasht - Completely agree with you on this. Although I don't have any data on this, IMO date histogram aggregations are mostly generated by UI or dashboard components for visualizing the data. They are quite expensive as well (~7s) rendering them too slow or impractical for low latency applications. Hence, the request concurrency should not be high.

Also, is there any overhead caused by the number of splits, for example, consider a date aggregation query where the results are aggregated at a month level for an year and say the data is split across 20 shards in 20 different nodes. In that case, instead of 20 network call (between the nodes), this would make 20*12 network calls and there may be some overhead with respect of serializing/de-serializing the request/responses.

Assuming the throughput of such requests is not high, the overhead caused due to 20*12 network calls should not be huge. We can also evaluate the query splitting on data node (instead of coordinator) to avoid network calls overhead.

@gashutos
Copy link
Contributor

gashutos commented Aug 23, 2023

@jainankitk how many number of shards you have in index ? Running your query with 2 slice is equivalent to doubling the shard count right ?
I think this will give even better improvement on top of concurrent segment search. with higher number of replicas since we will able to achieve more parallelization.

@jainankitk
Copy link
Collaborator Author

@jainankitk how many number of shards you have in index ?

I ran this experiment with single shard, single node.

Running your query with 2 slice is equivalent to doubling the shard count right ?

Yes, but doubling the shard count dynamically is not possible. It already is what it is. Given any shard count, we can always rewrite to slice the query like above

I think this will give even better improvement on top of concurrent segment search. with higher number of replicas since we will able to achieve more parallelization.

While I don't understand the correlation with higher number of replicas, the rewriting is additive to concurrent segment search, given enough CPU is available

@gashutos
Copy link
Contributor

While I don't understand the correlation with higher number of replicas, the rewriting is additive to concurrent segment search, given enough CPU is available

Let say if we have 3 replica for a primary shard, we will have 4 shards total after all on 4 nodes.
Search request for that shard will land to any shard selected through ARS ( adaptive replica selection ). So 4 concurrent search request will be served through 4 different nodes and all individual replica will utilize available individual node CPU through concurrent segment search.
So given if we rewrite query to act like 4 different parallel query, it is latency wise benefit on top of concurrent search.
But here results wont be consistent since replicas might have different documents based on sync state with primary

@mikemccand
Copy link

mikemccand commented Aug 28, 2023

While this approach of split (into multiple smaller range queries) and merge (into one result) looks promising, I think the net CPU utilization will remain the same. Or in other words, in a busy system where are all search threads are (nearly) occupied, the performance benefit may not be seen.

@backslasht - Completely agree with you on this. Although I don't have any data on this, IMO date histogram aggregations are mostly generated by UI or dashboard components for visualizing the data. They are quite expensive as well (~7s) rendering them too slow or impractical for low latency applications. Hence, the request concurrency should not be high.

This is indeed a zero sum game -- as you use more concurrency within a single query, latencies get much better for queries when the cluster is not near red-line (100% CPU saturation), yet red-line QPS (capacity) is no better and maybe a bit worse due to costs of thread switching (Java 19's new light weight threads (Project Loom) may help here).

But note that this tradeoff is nearly always very worthwhile. Modern CPUs/clusters have tremendous concurrency that is horribly underutilized during "peace time" (cluster not near 100% CPU saturation). Tapping into this concurrency to make single queries faster is an excellent tradeoff -- you massively reduce latency for the few in-flight queries. Most applications very rarely run their clusters at red-line.

It's crazy to me how long it took ES/OS to simply make use of Lucene's intra-concurrency feature -- it's been available in Lucene for a looooong time, and CPUs finally became concurrent enough to make it a no-brainer. In our service (Amazon's customer facing product search) we've used this feature since the very start of the (lengthy!) migration to Lucene and it was vital to the success of that migration.

If latency / capacity at red-line is really a problem, OpenSearch could add some "smart" logic to taper the concurrency when the cluster is near capacity, because at that point the added intra-query concurrency can only hurt things. But I would test empirically to see if such smart logic is really necessary -- the thread context switching cost is likely low enough (and decreasing with time with Project Loom) that it's not worth doing that. Really OS should focus on reducing query latencies during peace time (low cluster utilization).

@mikemccand
Copy link

But here results wont be consistent since replicas might have different documents based on sync state with primary

Note that this is true even without the added intra-query concurrency?

With document based replication there is no transactional guarantee (consistent point-in-time view of the Search index).

But segment based replication can offer this, if we choose to implement it -- a query can know, even across replicas, precisely which point-in-time view of the index it is searching. Replicas can also preserve (keep open, for a time) recent point-in-time views so that queries can be nearly 100% consistent.

Given that document based replication has forever NOT had such guarantees, we should not start adding it just yet with segment replication. But segment replication makes it quite a simple feature (transactionaly consistency) to add later on, if it really matters to users ...

@jainankitk
Copy link
Collaborator Author

I tried manually rewriting the query as filter aggregation, but did not notice any performance improvement:

|                                                 Min Throughput | date_histogram_calendar_interval |        0.13 |  ops/s |
|                                                Mean Throughput | date_histogram_calendar_interval |        0.13 |  ops/s |
|                                              Median Throughput | date_histogram_calendar_interval |        0.13 |  ops/s |
|                                                 Max Throughput | date_histogram_calendar_interval |        0.13 |  ops/s |
|                                        50th percentile latency | date_histogram_calendar_interval |     68206.1 |     ms |
|                                        90th percentile latency | date_histogram_calendar_interval |     90803.8 |     ms |
|                                       100th percentile latency | date_histogram_calendar_interval |     96544.5 |     ms |
|                                   50th percentile service time | date_histogram_calendar_interval |     7673.54 |     ms |
|                                   90th percentile service time | date_histogram_calendar_interval |     7741.49 |     ms |
|                                  100th percentile service time | date_histogram_calendar_interval |      7806.9 |     ms |
|                                                     error rate | date_histogram_calendar_interval |           0 |      % |
|                                                 Min Throughput |               date_histogram_agg |        0.73 |  ops/s |
|                                                Mean Throughput |               date_histogram_agg |        0.75 |  ops/s |
|                                              Median Throughput |               date_histogram_agg |        0.74 |  ops/s |
|                                                 Max Throughput |               date_histogram_agg |        0.78 |  ops/s |
|                                        50th percentile latency |               date_histogram_agg |     519.746 |     ms |
|                                        90th percentile latency |               date_histogram_agg |     522.896 |     ms |
|                                       100th percentile latency |               date_histogram_agg |     525.959 |     ms |
|                                   50th percentile service time |               date_histogram_agg |     518.406 |     ms |
|                                   90th percentile service time |               date_histogram_agg |     520.888 |     ms |
|                                  100th percentile service time |               date_histogram_agg |     524.325 |     ms |
|                                                     error rate |               date_histogram_agg |           0 |      % |
|                                                 Min Throughput |        date_histogram_agg_filter |        0.13 |  ops/s |
|                                                Mean Throughput |        date_histogram_agg_filter |        0.13 |  ops/s |
|                                              Median Throughput |        date_histogram_agg_filter |        0.13 |  ops/s |
|                                                 Max Throughput |        date_histogram_agg_filter |        0.13 |  ops/s |
|                                        50th percentile latency |        date_histogram_agg_filter |     67761.8 |     ms |
|                                        90th percentile latency |        date_histogram_agg_filter |     90681.7 |     ms |
|                                       100th percentile latency |        date_histogram_agg_filter |     96495.7 |     ms |
|                                   50th percentile service time |        date_histogram_agg_filter |     7769.12 |     ms |
|                                   90th percentile service time |        date_histogram_agg_filter |     7843.55 |     ms |
|                                  100th percentile service time |        date_histogram_agg_filter |     7888.28 |     ms |
|                                                     error rate |        date_histogram_agg_filter |           0 |      % |

@jainankitk
Copy link
Collaborator Author

Okay, I missed adding the range filter in my filter aggregation query. Due to that the flame graph had MatchAll query on which collector was being run:

    {   
      "name": "date_histogram_agg_filter",
      "operation-type": "search",
      "body": {
        "size": 0,
        "aggs": {
          "dropoffs_over_time": {
            "filters": {
              "filters": {
                "Jan" : {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-02-01 00:00:00"}}},
                "Feb" : { "range": {"dropoff_datetime": {"gte": "2015-02-01 00:00:00","lt": "2015-03-01 00:00:00"}}}
              }   
            }   
          }   
        }   
      }   
    }

Screenshot 2023-09-14 at 11 18 33 AM

@jainankitk
Copy link
Collaborator Author

After adding the query range filter, the flame graph does not have matchAll query:

"query": {
          "range": {
            "dropoff_datetime": {
              "gte": "2015-01-01 00:00:00",
              "lt": "2015-03-01 00:00:00"
            }
          }
        }

Screenshot 2023-09-14 at 11 38 37 AM

@jainankitk
Copy link
Collaborator Author

jainankitk commented Sep 18, 2023

From small POC based on this, the performance improvement for above specific query is from 6947 to 172 ms:

With the rewrite and filter aggregation improvements:

$ curl -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'
> {
>   "size": 0,
>   "aggs" : {
>     "messages" : {
>       "filters" : {
>         "filters" : {
>           "1420070400000" :   {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2015-02-01 00:00:00"}}},
>           "1422748800000" :  { "range": {"dropoff_datetime": {"gte": "2015-02-01 00:00:00","lt": "2015-03-01 00:00:00"}}},
>   "1425168000000" :  { "range": {"dropoff_datetime": {"gte": "2015-03-01 00:00:00","lt": "2015-04-01 00:00:00"}}},
>   "1427846400000" :  { "range": {"dropoff_datetime": {"gte": "2015-04-01 00:00:00","lt": "2015-05-01 00:00:00"}}},
>   "1430438400000," :  { "range": {"dropoff_datetime": {"gte": "2015-05-01 00:00:00","lt": "2015-06-01 00:00:00"}}},
>    "1433116800000" :  { "range": {"dropoff_datetime": {"gte": "2015-06-01 00:00:00","lt": "2015-07-01 00:00:00"}}},
>   "1435708800000" :  { "range": {"dropoff_datetime": {"gte": "2015-07-01 00:00:00","lt": "2015-08-01 00:00:00"}}},
>   "1438387200000" :  { "range": {"dropoff_datetime": {"gte": "2015-08-01 00:00:00","lt": "2015-09-01 00:00:00"}}},
>    "1441065600000" :  { "range": {"dropoff_datetime": {"gte": "2015-09-01 00:00:00","lt": "2015-10-01 00:00:00"}}},
>   "1443657600000" :  { "range": {"dropoff_datetime": {"gte": "2015-10-01 00:00:00","lt": "2015-11-01 00:00:00"}}},
>   "1446336000000" : { "range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00","lt": "2015-12-01 00:00:00"}}},
>   "1448928000000" : { "range": {"dropoff_datetime": {"gte": "2015-12-01 00:00:00","lt": "2016-01-01 00:00:00"}}}
>         }
>       }
>     }
>   }
> }
> '
{
  "took" : 172,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "messages" : {
      "buckets" : {
        "1420070400000" : {
          "doc_count" : 14249568
        },
        "1422748800000" : {
          "doc_count" : 14024615
        },
        "1425168000000" : {
          "doc_count" : 15078229
        },
        "1427846400000" : {
          "doc_count" : 14734472
        },
        "1430438400000," : {
          "doc_count" : 14947963
        },
        "1433116800000" : {
          "doc_count" : 13962891
        },
        "1435708800000" : {
          "doc_count" : 13101209
        },
        "1438387200000" : {
          "doc_count" : 12667603
        },
        "1441065600000" : {
          "doc_count" : 12717785
        },
        "1443657600000" : {
          "doc_count" : 13941138
        },
        "1446336000000" : {
          "doc_count" : 12849174
        },
        "1448928000000" : {
          "doc_count" : 13068510
        }
      }
    }
  }
}

Original:

$ curl -s -X GET "localhost:9200/nyc_taxis/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 0,"query": {"range": {"dropoff_datetime": {"gte": "2015-01-01 00:00:00","lt": "2016-01-01 00:00:00"}}},"aggs": {"dropoffs_over_time": {"date_histogram":
 {"field": "dropoff_datetime","calendar_interval": "month"}}}}'
{
  "took" : 6947,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "dropoffs_over_time" : {
      "buckets" : [
        {
          "key_as_string" : "2015-01-01 00:00:00",
          "key" : 1420070400000,
          "doc_count" : 14249568
        },
        {
          "key_as_string" : "2015-02-01 00:00:00",
          "key" : 1422748800000,
          "doc_count" : 14024615
        },
        {
          "key_as_string" : "2015-03-01 00:00:00",
          "key" : 1425168000000,
          "doc_count" : 15078229
        },
        {
          "key_as_string" : "2015-04-01 00:00:00",
          "key" : 1427846400000,
          "doc_count" : 14734472
        },
        {
          "key_as_string" : "2015-05-01 00:00:00",
          "key" : 1430438400000,
          "doc_count" : 14947963
        },
        {
          "key_as_string" : "2015-06-01 00:00:00",
          "key" : 1433116800000,
          "doc_count" : 13962891
        },
        {
          "key_as_string" : "2015-07-01 00:00:00",
          "key" : 1435708800000,
          "doc_count" : 13101209
        },
        {
          "key_as_string" : "2015-08-01 00:00:00",
          "key" : 1438387200000,
          "doc_count" : 12667603
        },
        {
          "key_as_string" : "2015-09-01 00:00:00",
          "key" : 1441065600000,
          "doc_count" : 12717785
        },
        {
          "key_as_string" : "2015-10-01 00:00:00",
          "key" : 1443657600000,
          "doc_count" : 13941138
        },
        {
          "key_as_string" : "2015-11-01 00:00:00",
          "key" : 1446336000000,
          "doc_count" : 12849174
        },
        {
          "key_as_string" : "2015-12-01 00:00:00",
          "key" : 1448928000000,
          "doc_count" : 13068510
        }
      ]
    }
  }
}

@jainankitk
Copy link
Collaborator Author

Code changes for poc - jainankitk@c3b447c

@backslasht
Copy link
Contributor

From small POC based on this, the performance improvement for above specific query is from 6947 to 172 ms

Thanks @jainankitk! The numbers are pretty impressive.

@backslasht
Copy link
Contributor

How are you thinking of implementing this? Are you considering to build a generic query rewrite framework?

@jainankitk
Copy link
Collaborator Author

Are you considering to build a generic query rewrite framework?

I spent sometime on the code and seems that generic query framework across all query types might be tricky. Although, we will add generic logic which can account for different types of aggregation rewrites.

How are you thinking of implementing this?

I am planning to do this rewrite at shard level as it already has the parsing logic and we need not duplicate that on the coordinator layer. I am planning to do the rewrite in two steps:

  1. Rewrite the date histogram into range aggregation first if possible. This has 2 advantages over rewriting into filter aggregation:
  • Performance of default range aggregation is similar to date histogram as they use similar logic for bucketing and much better than the default filter aggregation. Hence, this rewrite can always be done without taking any performance hit
  • Doing this as two step process benefits both date histogram and range aggregation
  1. Rewrite the range aggregation into filter aggregation if segment does not have any deletions. Else, run it as date histogram aggregation

@jainankitk
Copy link
Collaborator Author

Thanks @jainankitk! The numbers are pretty impressive

Thanks @backslasht. I also verified that even on smaller duration date histograms like 1 day or 15-20 days, the improvement is still about 4-5x.

@jainankitk
Copy link
Collaborator Author

jainankitk commented Oct 25, 2023

After applying the optimizations to date histogram aggregations, saw below numbers for nyc_taxis aggregation benchmark:

|                                                 Min Throughput |              distance_amount_agg |        2.02 |  ops/s |
|                                                Mean Throughput |              distance_amount_agg |        2.02 |  ops/s |
|                                              Median Throughput |              distance_amount_agg |        2.02 |  ops/s |
|                                                 Max Throughput |              distance_amount_agg |        2.04 |  ops/s |
|                                        50th percentile latency |              distance_amount_agg |     7.73538 |     ms |
|                                        90th percentile latency |              distance_amount_agg |     8.40921 |     ms |
|                                       100th percentile latency |              distance_amount_agg |     8.87938 |     ms |
|                                   50th percentile service time |              distance_amount_agg |     6.46754 |     ms |
|                                   90th percentile service time |              distance_amount_agg |     6.95528 |     ms |
|                                  100th percentile service time |              distance_amount_agg |     7.51381 |     ms |
|                                                     error rate |              distance_amount_agg |           0 |      % |
|                                                 Min Throughput | date_histogram_calendar_interval |        1.51 |  ops/s |
|                                                Mean Throughput | date_histogram_calendar_interval |        1.51 |  ops/s |
|                                              Median Throughput | date_histogram_calendar_interval |        1.51 |  ops/s |
|                                                 Max Throughput | date_histogram_calendar_interval |        1.53 |  ops/s |
|                                        50th percentile latency | date_histogram_calendar_interval |     16.8481 |     ms |
|                                        90th percentile latency | date_histogram_calendar_interval |     17.8246 |     ms |
|                                        99th percentile latency | date_histogram_calendar_interval |     24.3817 |     ms |
|                                       100th percentile latency | date_histogram_calendar_interval |     27.7471 |     ms |
|                                   50th percentile service time | date_histogram_calendar_interval |      15.439 |     ms |
|                                   90th percentile service time | date_histogram_calendar_interval |     16.2973 |     ms |
|                                   99th percentile service time | date_histogram_calendar_interval |     22.7529 |     ms |
|                                  100th percentile service time | date_histogram_calendar_interval |      26.729 |     ms |
|                                                     error rate | date_histogram_calendar_interval |           0 |      % |
|                                                 Min Throughput |    date_histogram_fixed_interval |        1.51 |  ops/s |
|                                                Mean Throughput |    date_histogram_fixed_interval |        1.52 |  ops/s |
|                                              Median Throughput |    date_histogram_fixed_interval |        1.51 |  ops/s |
|                                                 Max Throughput |    date_histogram_fixed_interval |        1.53 |  ops/s |
|                                        50th percentile latency |    date_histogram_fixed_interval |     14.2897 |     ms |
|                                        90th percentile latency |    date_histogram_fixed_interval |     14.9021 |     ms |
|                                        99th percentile latency |    date_histogram_fixed_interval |     20.7376 |     ms |
|                                       100th percentile latency |    date_histogram_fixed_interval |     55.6359 |     ms |
|                                   50th percentile service time |    date_histogram_fixed_interval |     12.7421 |     ms |
|                                   90th percentile service time |    date_histogram_fixed_interval |     13.4117 |     ms |
|                                   99th percentile service time |    date_histogram_fixed_interval |     19.2192 |     ms |
|                                  100th percentile service time |    date_histogram_fixed_interval |     53.8581 |     ms |
|                                                     error rate |    date_histogram_fixed_interval |           0 |      % |
|                                                 Min Throughput |               date_histogram_agg |        1.51 |  ops/s |
|                                                Mean Throughput |               date_histogram_agg |        1.51 |  ops/s |
|                                              Median Throughput |               date_histogram_agg |        1.51 |  ops/s |
|                                                 Max Throughput |               date_histogram_agg |        1.52 |  ops/s |
|                                        50th percentile latency |               date_histogram_agg |     27.4653 |     ms |
|                                        90th percentile latency |               date_histogram_agg |     29.1013 |     ms |
|                                        99th percentile latency |               date_histogram_agg |     34.5083 |     ms |
|                                       100th percentile latency |               date_histogram_agg |     44.9958 |     ms |
|                                   50th percentile service time |               date_histogram_agg |     26.0248 |     ms |
|                                   90th percentile service time |               date_histogram_agg |     27.9901 |     ms |
|                                   99th percentile service time |               date_histogram_agg |     33.4492 |     ms |
|                                  100th percentile service time |               date_histogram_agg |     43.4751 |     ms |
|                                                     error rate |               date_histogram_agg |           0 |      % |



---------------------------------
[INFO] SUCCESS (took 411 seconds)
---------------------------------

@jainankitk
Copy link
Collaborator Author

Benchmark numbers without the optimization on same setup:

|                                                 Min Throughput | date_histogram_calendar_interval |        0.11 |  ops/s |
|                                                Mean Throughput | date_histogram_calendar_interval |        0.11 |  ops/s |
|                                              Median Throughput | date_histogram_calendar_interval |        0.11 |  ops/s |
|                                                 Max Throughput | date_histogram_calendar_interval |        0.11 |  ops/s |
|                                        50th percentile latency | date_histogram_calendar_interval |      856470 |     ms |
|                                        90th percentile latency | date_histogram_calendar_interval | 1.19283e+06 |     ms |
|                                        99th percentile latency | date_histogram_calendar_interval | 1.26869e+06 |     ms |
|                                       100th percentile latency | date_histogram_calendar_interval | 1.27709e+06 |     ms |
|                                   50th percentile service time | date_histogram_calendar_interval |     9139.83 |     ms |
|                                   90th percentile service time | date_histogram_calendar_interval |     9238.84 |     ms |
|                                   99th percentile service time | date_histogram_calendar_interval |     9329.85 |     ms |
|                                  100th percentile service time | date_histogram_calendar_interval |     9352.46 |     ms |
|                                                     error rate | date_histogram_calendar_interval |           0 |      % |
|                                                 Min Throughput |    date_histogram_fixed_interval |        0.11 |  ops/s |
|                                                Mean Throughput |    date_histogram_fixed_interval |        0.11 |  ops/s |
|                                              Median Throughput |    date_histogram_fixed_interval |        0.11 |  ops/s |
|                                                 Max Throughput |    date_histogram_fixed_interval |        0.11 |  ops/s |
|                                        50th percentile latency |    date_histogram_fixed_interval |      821916 |     ms |
|                                        90th percentile latency |    date_histogram_fixed_interval | 1.14567e+06 |     ms |
|                                        99th percentile latency |    date_histogram_fixed_interval | 1.21831e+06 |     ms |
|                                       100th percentile latency |    date_histogram_fixed_interval |  1.2264e+06 |     ms |
|                                   50th percentile service time |    date_histogram_fixed_interval |     8820.67 |     ms |
|                                   90th percentile service time |    date_histogram_fixed_interval |     8895.23 |     ms |
|                                   99th percentile service time |    date_histogram_fixed_interval |     9008.92 |     ms |
|                                  100th percentile service time |    date_histogram_fixed_interval |      9014.7 |     ms |
|                                                     error rate |    date_histogram_fixed_interval |           0 |      % |
|                                                 Min Throughput |               date_histogram_agg |        1.49 |  ops/s |
|                                                Mean Throughput |               date_histogram_agg |         1.5 |  ops/s |
|                                              Median Throughput |               date_histogram_agg |         1.5 |  ops/s |
|                                                 Max Throughput |               date_histogram_agg |         1.5 |  ops/s |
|                                        50th percentile latency |               date_histogram_agg |     548.597 |     ms |
|                                        90th percentile latency |               date_histogram_agg |     558.008 |     ms |
|                                        99th percentile latency |               date_histogram_agg |     571.611 |     ms |
|                                       100th percentile latency |               date_histogram_agg |     572.667 |     ms |
|                                   50th percentile service time |               date_histogram_agg |     547.138 |     ms |
|                                   90th percentile service time |               date_histogram_agg |     557.289 |     ms |
|                                   99th percentile service time |               date_histogram_agg |     569.914 |     ms |
|                                  100th percentile service time |               date_histogram_agg |     571.978 |     ms |
|                                                     error rate |               date_histogram_agg |           0 |      % |

@jainankitk
Copy link
Collaborator Author

@getsaurabh02 getsaurabh02 added the v2.12.0 Issues and PRs related to version 2.12.0 label Oct 30, 2023
@getsaurabh02
Copy link
Member

Thanks @jainankitk this is exciting!! The benchmark numbers for date histogram operations here are very encouraging.

While we are trying out the initial POC with the Date Histogram changes, does it makes sense to convert this issue into a META Issue and breakdown the improvements into specific tracks for improvements further? That way we can encourage some incremental/parallel help from the community while we dive into the optimization.

@hdhalter
Copy link

Hi @jainankitk, will documentation for this feature be required for 2.12? If so, can you please create the doc issue, let me know who will be creating the initial doc PR, and add this dev issue to the unified tracker project? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Performance This is for any performance related enhancements or bugs Search:Aggregations v2.12.0 Issues and PRs related to version 2.12.0
Projects
Status: 2.12.0 (Launched)
Status: Done
9 participants