Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced stream.findFirst by for loop for hybrid query #706

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Apr 24, 2024

Description

Hybrid query is generally slower than other compound queries with similar child sub-queries/clauses. For instance if compared to Boolean it can be up to 12 times slower, depending on the dataset, query and index/cluster configuration. Check results of benchmark that I took for released 2.13 using noaa OSB workload, all time is in ms:

One sub-query that selects 11M documents

Bool: 98.1014
Hybrid: 973.683

One sub-query that selects 1.6K documents

Bool: 181.046
Hybrid: 90.1155

Three sub-query that select 15M documents

Bool: 117.505
Hybrid: 1458.8

Based on results of profiling most of the CPU time (35 to 40%) is taken by Stream.findFirst call in HybridQueryScorer.

That code is executed for each document returned by each of sub-query. That explains much longer execution time for queries that return larger sub-sets of a dataset.

That section of the code can be optimized to a plain for loop, plus the list of Integer is replaced by the plain array of ints. After optimization same code section takes 5 to 8% of overall execution time. Total time for clean hybrid query has been decreased 3-4 times for large sub-sets.

Below are detailed results for the same workload:

One sub-query that selects 11M documents

Bool: 84.7201
Hybrid: 256.799

One sub-query that selects 1.6K documents

Bool: 89.6258
Hybrid: 85.4563

Three sub-query that select 15M documents

Bool: 90.1481
Hybrid: 326.331

following were bool queries used in testing

Query 1
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "term": {
                          "station.country_code": "JA"
                      }
                  },
                  {
                      "range": {
                          "TRANGE": {
                              "gte": 0,
                              "lte": 30
                          }
                      }
                  },
                  {
                    "range": {
                        "date": {
                            "gte": "2016-06-04",
                            "format":"yyyy-MM-dd"
                        }
                    }
                  }
              ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                          "TRANGE": {
                              "gte": -100,
                              "lte": -50
                          }
                      }
                  }
              ]
          }
        }

Query 3
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                        "TRANGE": {
                          "gte": 1,
                          "lte": 35
                        }
                      }
                  }
              ]
          }

equivalent hybrid queres are:

Query 1
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "term": {
                        "station.country_code": "JA"
                    }
                },
                {
                    "range": {
                        "TRANGE": {
                            "gte": 0,
                            "lte": 30
                        }
                    }
                },
                {
                  "range": {
                      "date": {
                          "gte": "2016-06-04",
                          "format":"yyyy-MM-dd"
                      }
                  }
                }
            ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                        "TRANGE": {
                          "gte": -100,
                          "lte": -50
                        }
                    }
                }
            ]
          }

Query 3
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                      "TRANGE": {
                        "gte": 1,
                        "lte": 35
                      }
                    }
                }
            ]
          }

Issues Resolved

#705

Check List

  • All tests pass
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added backport 2.x Label will add auto workflow to backport PR to 2.x branch v2.14.0 hybrid query performance optimization labels Apr 24, 2024
@martin-gaievski martin-gaievski changed the title Change stream.findFirst to for loop Replace stream.findFirst by for loop Apr 24, 2024
@martin-gaievski martin-gaievski force-pushed the performance_optimization_hybrid_query_replace_streaming_with_for_loop branch from 0376e9b to 348c1a8 Compare April 24, 2024 01:03
Copy link

codecov bot commented Apr 24, 2024

Codecov Report

Attention: Patch coverage is 58.82353% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 84.74%. Comparing base (cc6a6b2) to head (0376e9b).
Report is 12 commits behind head on main.

❗ Current head 0376e9b differs from pull request most recent head e34a7bc. Consider uploading reports for the commit e34a7bc to get more accurate results

Files Patch % Lines
...ensearch/neuralsearch/query/HybridQueryScorer.java 58.82% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #706      +/-   ##
============================================
+ Coverage     84.04%   84.74%   +0.69%     
- Complexity      744      774      +30     
============================================
  Files            59       59              
  Lines          2313     2412      +99     
  Branches        374      405      +31     
============================================
+ Hits           1944     2044     +100     
+ Misses          214      203      -11     
- Partials        155      165      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@martin-gaievski martin-gaievski marked this pull request as ready for review April 24, 2024 01:19
@martin-gaievski martin-gaievski changed the title Replace stream.findFirst by for loop Replaced stream.findFirst by for loop for hybrid query Apr 24, 2024
CHANGELOG.md Outdated Show resolved Hide resolved
@navneet1v
Copy link
Collaborator

This is an interesting improvement for such a small change. Really liked the deep-dive here.

Few things I would like you to do here just for bookkeeping purpose:

  1. The numbers and details which are quoted in the description please update them in the Issue.
  2. [Next step] Can you look at other places where we might be using streams in the hybrid query clause which may be adding some extra latency?

@vibrantvarun
Copy link
Member

Great catch @martin-gaievski . LGTM.

@chishui
Copy link
Contributor

chishui commented Apr 24, 2024

This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage

Co-authored-by: Navneet Verma <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
@martin-gaievski martin-gaievski merged commit b277b07 into opensearch-project:main Apr 24, 2024
73 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 24, 2024
* Change stream.findFirst to for loop

Signed-off-by: Martin Gaievski <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
(cherry picked from commit b277b07)
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.13 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.13 2.13
# Navigate to the new working tree
cd .worktrees/backport-2.13
# Create a new branch
git switch --create backport/backport-706-to-2.13
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b277b07e89e25f1abaa9de3a326fda3556dc8a77
# Push it to GitHub
git push --set-upstream origin backport/backport-706-to-2.13
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.13

Then, create a pull request where the base branch is 2.13 and the compare/head branch is backport/backport-706-to-2.13.

martin-gaievski added a commit that referenced this pull request Apr 24, 2024
* Change stream.findFirst to for loop

Signed-off-by: Martin Gaievski <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
(cherry picked from commit b277b07)
martin-gaievski added a commit that referenced this pull request Apr 24, 2024
* Change stream.findFirst to for loop

Signed-off-by: Martin Gaievski <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
(cherry picked from commit b277b07)

Co-authored-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Apr 24, 2024
* Change stream.findFirst to for loop

Signed-off-by: Martin Gaievski <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
(cherry picked from commit b277b07)
@martin-gaievski
Copy link
Member Author

This is an interesting improvement for such a small change. Really liked the deep-dive here.

Few things I would like you to do here just for bookkeeping purpose:

  1. The numbers and details which are quoted in the description please update them in the Issue.
  2. [Next step] Can you look at other places where we might be using streams in the hybrid query clause which may be adding some extra latency?
  1. Done
  2. Sure, we need to scan the code for similar calls. I haven't noticed any other places in code with similar problem that significantly add to performance of the query. For now the next step is to exclude TopDocsCollector from the flow when there is a hybrid query, that section has second highest latency.

@martin-gaievski
Copy link
Member Author

This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage

Next slowest section after Stream.findFirst is extra doc collector. It's added by core for all the queries it consumes resources (per my results it's from 40 to 75% of CPU time) but for hybrid query those results are just ignored. Current feasible approach requires changes in both core and the plugin.
I need to check how CPU usage is after the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants