Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added rescorer in hybrid query #917

Merged

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Sep 30, 2024

Description

Enable rescore functionality in hybrid query. Currently rescore clause in ignored and plain results of hybrid query are returned. Following is example of query with rescore clause.

Logic will be similar to one for traditional queries: rescore query will be applied to search hits returned by sub-queries, based on match/no match scores and final order of documents will be adjusted.

To implementing this we're using custom method to call the rescore. In QueryPhaseSearch we always return rescore == false, then we do actual check (rescore/not rescore) in the hybrid collector manager. We do the check for rescore clause at the shard level, before individual sub-query results are merged into a single object. That allows to minimize number of changes in existing logic for score normalization.

Following diagram shows flow that we introduced with the PR:

Rescore - flow diagram new

For comparison following diagram shows flow for today's implementation

Rescore - flow diagram existing

Example of rescore usage:

Query with rescore

{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                4.2,
                                5.0,
                                8.5
                            ],
                            "k": 10
                        }
                    }
                },
                {
                    "range": {
                        "field1": {
                            "gte": 1,
                            "lte": 60
                        }
                    }
                }
            ]
        }
    },
    "rescore": [
        {
            "query": {
                "rescore_query": {
                    "range": {
                        "field1": {
                            "gte": 1,
                            "lte": 20
                        }
                    }
                },
                "score_mode": "total",
                "query_weight": 0.9,
                "rescore_query_weight": 1.2
            },
            "window_size": 10
        }
    ]
}

Response with implemented rescore from this PR, where documents with field in range of [1..20] are boosted

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 0.6695955,
        "hits": [
            {
                "_index": "index-test",
                "_id": "BhGESpIB67ctKZAvyp6_",
                "_score": 0.6695955,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            },
            {
                "_index": "index-test",
                "_id": "BxGESpIB67ctKZAv057_",
                "_score": 0.6695361,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "CBGESpIB67ctKZAv3p5K",
                "_score": 0.3199462,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            }
        ]
    }
}

same query before this PR or without rescore gives following response. Note the document with "field1" == 50, it has the highest score, which is the raw score after normalization.

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 0.7885865,
        "hits": [
            {
                "_index": "index-test",
                "_id": "CBGESpIB67ctKZAv3p5K",
                "_score": 0.7885865,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "BhGESpIB67ctKZAvyp6_",
                "_score": 0.2954152,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            },
            {
                "_index": "index-test",
                "_id": "BxGESpIB67ctKZAv057_",
                "_score": 0.29524556,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            }
        ]
    }
}

Few important notes regarding new logic:

  • Scores will be scaled as per rules of normalization/combination techniques, typically in range [0, 1.0]. This is because normalization process will still be executed after rescore logic is applied.

  • Rescore doesn't work with sorting, this is in parity with same behavior for traditional query.

Related Issues

#914

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added the Enhancements Increases software capabilities beyond original client specifications label Sep 30, 2024
@@ -161,28 +164,84 @@ private List<ReduceableSearchResult> getSearchResults(final List<HybridSearchCol
List<ReduceableSearchResult> results = new ArrayList<>();
DocValueFormat[] docValueFormats = getSortValueFormats(sortAndFormats);
for (HybridSearchCollector collector : hybridSearchCollectors) {
TopDocsAndMaxScore topDocsAndMaxScore = getTopDocsAndAndMaxScore(collector, docValueFormats);
boolean isSortEnabled = docValueFormats != null;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is small refactoring, not directly related to rescore change

if (isSortEnabled) {
return getSortedTopDocsAndMaxScore(topDocs, hybridSearchCollector);
}
return getRescoredTopDocsAndMaxScore(topDocs, hybridSearchCollector);
Copy link
Member Author

@martin-gaievski martin-gaievski Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one is more a refactoring, not directly related to rescore logic

@navneet1v
Copy link
Collaborator

  • Rescore doesn't work with sorting, this is in parity with same behavior for traditional query.

Does bool query throws any error in this case?

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good to me. I see that we have used similar code for rescore processor from Opensearch core. Would love to see if we can create a GH issue to track how we can reuse the code from OS core here.

@martin-gaievski
Copy link
Member Author

  • Rescore doesn't work with sorting, this is in parity with same behavior for traditional query.

Does bool query throws any error in this case?

yes, bool will return 400 response with same error message. Following is example of response from bool query with both rescore and sort clauses:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Cannot use [sort] option in conjunction with [rescore]."
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "index-test",
                "node": "jv4YW4ipTP2SLRQOraPkQw",
                "reason": {
                    "type": "illegal_argument_exception",
                    "reason": "Cannot use [sort] option in conjunction with [rescore]."
                }
            }
        ],
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Cannot use [sort] option in conjunction with [rescore].",
            "caused_by": {
                "type": "illegal_argument_exception",
                "reason": "Cannot use [sort] option in conjunction with [rescore]."
            }
        }
    },
    "status": 400
}

@martin-gaievski
Copy link
Member Author

Overall code looks good to me. I see that we have used similar code for rescore processor from Opensearch core. Would love to see if we can create a GH issue to track how we can reuse the code from OS core here.

sure, that makes sense. I'll create an issue for core and put the link in this PR

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks Martin. Great job👏

@martin-gaievski
Copy link
Member Author

Created new GH issue in core in regards to adding more flexibility to Rescore processor opensearch-project/OpenSearch#16183

@martin-gaievski martin-gaievski merged commit 9f4a49a into opensearch-project:main Oct 3, 2024
38 of 39 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-917-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9f4a49a7e45211821d96181ce2a6842af18ce7ea
# Push it to GitHub
git push --set-upstream origin backport/backport-917-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-917-to-2.x.

martin-gaievski added a commit that referenced this pull request Oct 3, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Oct 4, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Added rescorer in hybrid query (#917)

* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Added rescorer in hybrid query (#917)

* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit fe50537)
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Oct 4, 2024
opensearch-project#924)

* Added rescorer in hybrid query (opensearch-project#917)

* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit fe50537)
martin-gaievski added a commit that referenced this pull request Oct 4, 2024
* Added rescorer in hybrid query (#917)

* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 9f4a49a)
Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit fe50537)
martin-gaievski added a commit to Johnsonisaacn/neural-search that referenced this pull request Oct 11, 2024
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
zhichao-aws pushed a commit to zhichao-aws/neural-search that referenced this pull request Jan 6, 2025
* Initial version for rescorer

Signed-off-by: Martin Gaievski <[email protected]>
@will-hwang will-hwang mentioned this pull request Jan 7, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch Enhancements Increases software capabilities beyond original client specifications v2.18.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants