Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runtime exceptions in hybrid query for case when sub-query scorer return TwoPhase iterator that is incompatible with DISI iterator #624

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Mar 3, 2024

Description

Adding approximation two phase iterator to hybrid query. That helps in scenarios when sub-queries are complex and direct iteration over all scorers is expensive and may lead to instability, like runtime exceptions.

Example of such query is bool with filter some complex queries in should like dis_max.

"query": {
        "hybrid": {
            "queries": [
                {
                    "bool": {
                        "filter": [
                            {
                                "terms": {
                                    "field1": [
                                        "value1",
                                        "value2"
                                    ]
                                }
                            }
                        ],
                        "should": {
                            "dis_max": {
                                "queries": [
                                    {
                                        "multi_match": {
                                            "query": "{{query_text}}",
                                            "type": "phrase_prefix"
                                        }
                                    },
                                    {
                                        "multi_match": {
                                            "query": "{{query_text}}",
                                            "type": "most_fields"
                                        }
                                    }
                                ]
}}}}]}}

System uses ReqOptSumScorer that is based on approximation iterator for scores and two phase iterator for docs. In underlying query clause is complex enough and data set is large avoiding such iterations and using scorers directly may lead to runtime exceptions described in #621

Key changes in this PR:

  • added two phase iterator for scorer
  • add collector manager, inject it at preProcess of aggregationProcessor phase in search
  • make query phase search light, only validate for hybrid query special cases
  • change the way sub-queries are rewritten - instead of static Rewriteable.rewrite switch to QueryBuild.rewrite

Issues Resolved

#621

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added Bug Fixes Changes to a system or product designed to handle a programming bug/glitch v2.13.0 labels Mar 3, 2024
@martin-gaievski martin-gaievski self-assigned this Mar 3, 2024
Signed-off-by: Martin Gaievski <[email protected]>
@martin-gaievski martin-gaievski force-pushed the hybrid_query_scorers_with_approximations branch from 46775d3 to d48c06c Compare March 3, 2024 06:27
Copy link

codecov bot commented Mar 3, 2024

Codecov Report

Attention: Patch coverage is 70.19231% with 93 lines in your changes are missing coverage. Please review.

Project coverage is 82.70%. Comparing base (b97dbe8) to head (65553e7).
Report is 1 commits behind head on main.

Files Patch % Lines
...ensearch/neuralsearch/query/HybridQueryScorer.java 63.91% 27 Missing and 8 partials ⚠️
...ralsearch/search/query/HybridCollectorManager.java 77.77% 8 Missing and 12 partials ⚠️
...ensearch/neuralsearch/query/HybridQueryWeight.java 61.76% 10 Missing and 3 partials ⚠️
...euralsearch/search/HybridTopScoreDocCollector.java 54.16% 6 Missing and 5 partials ⚠️
...arch/query/HybridScoreBlockBoundaryPropagator.java 78.78% 4 Missing and 3 partials ⚠️
...earch/search/query/HybridAggregationProcessor.java 70.83% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #624      +/-   ##
============================================
- Coverage     84.44%   82.70%   -1.74%     
- Complexity      604      650      +46     
============================================
  Files            48       51       +3     
  Lines          1826     2053     +227     
  Branches        276      329      +53     
============================================
+ Hits           1542     1698     +156     
- Misses          161      212      +51     
- Partials        123      143      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@martin-gaievski martin-gaievski force-pushed the hybrid_query_scorers_with_approximations branch 3 times, most recently from 6cd6b77 to dd27905 Compare March 4, 2024 21:22
@martin-gaievski
Copy link
Member Author

@msfroh @reta - can you please take a look if there are some issues in how we've added approximation and two phase iterators?

@jmazanec15
Copy link
Member

@martin-gaievski Could you provide additional context on why rewrite needs to be changed?

@jmazanec15
Copy link
Member

System uses ReqOptSumScorer that is based on approximation iterator for scores and two phase iterator for docs.

Are we actually using ReqOptSumScorer in this PR? I didnt see it. Seems like we model the new scorer off of it.

@msfroh
Copy link

msfroh commented Mar 6, 2024

I suspect that this is very similar to opensearch-project/OpenSearch#8155.

I'm going to take a closer look at ReqOptSumScorer, as it was implicated in that issue as well. As @noCharger and I were digging into that issue, we found that normally DISI and TwoPhaseIterator are "adaptable" into each other (where a DISI as a TPI is basically an "exact approximation" and a TPI as a DISI advances until it finds a doc where matches() is true).

I suspect that ReqOptSumScorer may be returning a TwoPhaseIterator that somehow doesn't give proper DISI behavior.

Anyway, I will take a look at that after I review this PR.

@martin-gaievski
Copy link
Member Author

@martin-gaievski Could you provide additional context on why rewrite needs to be changed?

With existing code does the rewrite of sub-queries via Rewriteable.rewrite method, with new approach that is switched to AbstractQueryBuilder.doRewrite. For many queries that doRewrite method has query specific logic that is simply skipped with today's approach, for example Term query or Dismax query.

Makes sense. But is this required for 2 phase iterator for hybrid query or is it unrelated to that component of this change?

It's needed in kind of indirect form - because rewrite is not completed for some queries AND we get scorers for all sub-queries via scorerSupplier that leads to some scorers being incorrectly set to null, meaning results of those sub-queries are ignored. With today's approach we still have those correct results from scorer

@martin-gaievski martin-gaievski force-pushed the hybrid_query_scorers_with_approximations branch from 8e5527b to 8243511 Compare March 7, 2024 02:05
@martin-gaievski martin-gaievski changed the title Adding two phase iterator for hybrid query Fix runtime exceptions in hybrid query for case when sub-query scorer return TwoPhase iterator that is incompatible with DISI iterator Mar 7, 2024
@martin-gaievski martin-gaievski force-pushed the hybrid_query_scorers_with_approximations branch from 8243511 to 67df195 Compare March 7, 2024 02:12
* In most cases it will be wrapped in MultiCollectorManager.
*/
@RequiredArgsConstructor
public abstract class HybridCollectorManager implements CollectorManager<Collector, ReduceableSearchResult> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the TwoPhaseIterator fix? Or is it more about allowing hybrid queries to run when concurrent search is enabled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its actually both. but @martin-gaievski can add more here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, that's both. It's needed in scope of two phase iterator because of the query search context, with today's implementation we're creating new collector specific to hybrid query and that effectively drops some logic from QueryPhase.executeInternal related to construction of that context.

@martin-gaievski martin-gaievski requested a review from reta March 8, 2024 22:00
@reta
Copy link
Contributor

reta commented Mar 8, 2024

Thanks @martin-gaievski , the part related to HybridCollectorManager looks good to me, thank you for making the changes!

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment. Overall code is good.

@martin-gaievski martin-gaievski added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Mar 11, 2024
@martin-gaievski martin-gaievski merged commit c9cdcc1 into opensearch-project:main Mar 11, 2024
62 of 64 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-624-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 c9cdcc148cd176becfc1456c9f27ab90aa4bfcf5
# Push it to GitHub
git push --set-upstream origin backport/backport-624-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-624-to-2.x.

martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Mar 11, 2024
… return TwoPhase iterator that is incompatible with DISI iterator (opensearch-project#624)

* Adding two phase iterator

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit c9cdcc1)
martin-gaievski added a commit that referenced this pull request Mar 11, 2024
… return TwoPhase iterator that is incompatible with DISI iterator (#624) (#628)

* Adding two phase iterator

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit c9cdcc1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch Bug Fixes Changes to a system or product designed to handle a programming bug/glitch v2.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants