In hybrid query replace Java Stream calls with a faster alternative #705

martin-gaievski · 2024-04-24T00:02:21Z

As part of performance optimization for Hybriq query slow Java Stream API calls need to be replaced by faster alternatives.

For instance if compared to Boolean it can be up to 12 times slower, depending on the dataset, query and index/cluster configuration. Check results of benchmark that I took for released 2.13 using noaa OSB workload, all times are in ms:

One sub-query that selects 11M documents

Bool: 98.1014
Hybrid: 973.683

One sub-query that selects 1.6K documents

Bool: 181.046
Hybrid: 90.1155

Three sub-query that select 15M documents

Bool: 117.505
Hybrid: 1458.8

Based on results of profiling most of the CPU time (35 to 40%) is taken by Stream.findFirst call in HybridQueryScorer.

That code is executed for each document returned by each of sub-query. That explains much longer execution time for queries that return larger sub-sets of a dataset.

That section of the code can be optimized to a plain for loop, plus the list of Integer is replaced by the plain array of ints. After optimization same code section takes 5 to 8% of overall execution time. Total time for clean hybrid query has been decreased 3-4 times for large sub-sets.

Below are detailed results for the same workload:

One sub-query that selects 11M documents

Bool: 84.7201
Hybrid: 256.799

One sub-query that selects 1.6K documents

Bool: 89.6258
Hybrid: 85.4563

Three sub-query that select 15M documents

Bool: 90.1481
Hybrid: 326.331

following were bool queries used in testing

Query 1
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "term": {
                          "station.country_code": "JA"
                      }
                  },
                  {
                      "range": {
                          "TRANGE": {
                              "gte": 0,
                              "lte": 30
                          }
                      }
                  },
                  {
                    "range": {
                        "date": {
                            "gte": "2016-06-04",
                            "format":"yyyy-MM-dd"
                        }
                    }
                  }
              ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                          "TRANGE": {
                              "gte": -100,
                              "lte": -50
                          }
                      }
                  }
              ]
          }
        }

Query 3
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                        "TRANGE": {
                          "gte": 1,
                          "lte": 35
                        }
                      }
                  }
              ]
          }

equivalent hybrid queres are:

Query 1
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "term": {
                        "station.country_code": "JA"
                    }
                },
                {
                    "range": {
                        "TRANGE": {
                            "gte": 0,
                            "lte": 30
                        }
                    }
                },
                {
                  "range": {
                      "date": {
                          "gte": "2016-06-04",
                          "format":"yyyy-MM-dd"
                      }
                  }
                }
            ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                        "TRANGE": {
                          "gte": -100,
                          "lte": -50
                        }
                    }
                }
            ]
          }

Query 3
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                      "TRANGE": {
                        "gte": 1,
                        "lte": 35
                      }
                    }
                }
            ]
          }

Based on these benchmark results for 2.13 following calls need to be reworked:

Stream.findFirst in HybridQueryScorer. Based on results of profiling that call takes 35-40% of time.

Any other similar code should be changed to a faster alternative.

The text was updated successfully, but these errors were encountered:

martin-gaievski · 2024-04-29T16:05:06Z

After first PR with optimization has been merged I take one more round of benchmarks and got following results. That is based on same 3 queries as in initial benchmark:

One sub-query that selects 11M documents

Bool: p50 88.2863 | p90 103.777
Hybrid: p50 299.427 | p90 319.797

One sub-query that selects 1.6K documents

Bool: p50 92.7222 | p90 98.6847
Hybrid: p50 94.5511 | p90 111.645

Three sub-query that select 15M documents

Bool: p50 98.0301 | p90 108.948
Hybrid: p50 475.319 | p90 515.999

Most time (~28%) is taken by store and lookup of the index based on query as a key. Depending on the exact sub-query calculation of its hash code can be slow (hybrid query works with any type of OpenSearch query). That is a problem on large datasets as this is done for each doc by each sub-query.

We can avoid creation and usage of that query to index map by storing sub-query index at time we create collection of DISIWrapers.

I've done the change and got following results for the same 3 types of hybrid query. As per benchmark results that gives about 20% performance boost. I've run it on 2.13 using noaa OSB workload, all times are in ms:

One sub-query that selects 11M documents

Bool: p50 97.9306 | p90 116.299
Hybrid: p50 228.696 | p90 249.665

One sub-query that selects 1.6K documents

Bool: p50 87.3152 | p90 89.3061
Hybrid: p50 89.9654 | p90 92.349

Three sub-query that select 15M documents

Bool: p50 97.9891 | p90 114.396
Hybrid: p50 353.631 | p90 377.527

PR with corresponding change has been merged to main and 2.x branches #711

martin-gaievski added performance Make it fast! hybrid search hybrid query performance optimization labels Apr 24, 2024

github-actions bot added the untriaged label Apr 24, 2024

This was referenced Apr 24, 2024

[META] Improve Hybrid query latency #704

Closed

Replaced stream.findFirst by for loop for hybrid query #706

Merged

Removed map of subquery to subquery index in favor of storing index as part of DISI wrapper to improve hybrid query latencies by 20% #711

Merged

martin-gaievski removed the untriaged label Apr 30, 2024

martin-gaievski added this to Vector Search RoadMap Apr 30, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Apr 30, 2024

martin-gaievski closed this as completed Apr 30, 2024

github-project-automation bot moved this from Backlog to ✅ Done in Vector Search RoadMap Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In hybrid query replace Java Stream calls with a faster alternative #705

In hybrid query replace Java Stream calls with a faster alternative #705

martin-gaievski commented Apr 24, 2024 •

edited

Loading

martin-gaievski commented Apr 29, 2024

In hybrid query replace Java Stream calls with a faster alternative #705

In hybrid query replace Java Stream calls with a faster alternative #705

Comments

martin-gaievski commented Apr 24, 2024 • edited Loading

martin-gaievski commented Apr 29, 2024

martin-gaievski commented Apr 24, 2024 •

edited

Loading