-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced stream.findFirst by for
loop for hybrid query
#706
Replaced stream.findFirst by for
loop for hybrid query
#706
Conversation
Signed-off-by: Martin Gaievski <[email protected]>
0376e9b
to
348c1a8
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #706 +/- ##
============================================
+ Coverage 84.04% 84.74% +0.69%
- Complexity 744 774 +30
============================================
Files 59 59
Lines 2313 2412 +99
Branches 374 405 +31
============================================
+ Hits 1944 2044 +100
+ Misses 214 203 -11
- Partials 155 165 +10 ☔ View full report in Codecov by Sentry. |
for
loop for hybrid query
This is an interesting improvement for such a small change. Really liked the deep-dive here. Few things I would like you to do here just for bookkeeping purpose:
|
Great catch @martin-gaievski . LGTM. |
This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage |
Co-authored-by: Navneet Verma <[email protected]> Signed-off-by: Martin Gaievski <[email protected]>
* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <[email protected]> Co-authored-by: Navneet Verma <[email protected]> (cherry picked from commit b277b07)
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.13 2.13
# Navigate to the new working tree
cd .worktrees/backport-2.13
# Create a new branch
git switch --create backport/backport-706-to-2.13
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b277b07e89e25f1abaa9de3a326fda3556dc8a77
# Push it to GitHub
git push --set-upstream origin backport/backport-706-to-2.13
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.13 Then, create a pull request where the |
* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <[email protected]> Co-authored-by: Navneet Verma <[email protected]> (cherry picked from commit b277b07)
* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <[email protected]> Co-authored-by: Navneet Verma <[email protected]> (cherry picked from commit b277b07) Co-authored-by: Martin Gaievski <[email protected]>
* Change stream.findFirst to for loop Signed-off-by: Martin Gaievski <[email protected]> Co-authored-by: Navneet Verma <[email protected]> (cherry picked from commit b277b07)
|
Next slowest section after Stream.findFirst is extra doc collector. It's added by core for all the queries it consumes resources (per my results it's from 40 to 75% of CPU time) but for hybrid query those results are just ignored. Current feasible approach requires changes in both core and the plugin. |
Description
Hybrid query is generally slower than other compound queries with similar child sub-queries/clauses. For instance if compared to Boolean it can be up to 12 times slower, depending on the dataset, query and index/cluster configuration. Check results of benchmark that I took for released 2.13 using noaa OSB workload, all time is in ms:
Based on results of profiling most of the CPU time (35 to 40%) is taken by Stream.findFirst call in HybridQueryScorer.
That code is executed for each document returned by each of sub-query. That explains much longer execution time for queries that return larger sub-sets of a dataset.
That section of the code can be optimized to a plain
for
loop, plus the list of Integer is replaced by the plain array of ints. After optimization same code section takes 5 to 8% of overall execution time. Total time for clean hybrid query has been decreased 3-4 times for large sub-sets.Below are detailed results for the same workload:
following were bool queries used in testing
equivalent hybrid queres are:
Issues Resolved
#705
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.