Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Neural Sparse Query Two Phase Search pipeline #747

Merged
merged 15 commits into from
Jun 4, 2024

Conversation

conggguan
Copy link
Contributor

@conggguan conggguan commented May 14, 2024

Description

This change implement for #646

  • Enhance the speed of neuralsparse query by two-phase.
  • Now support top-level, boolean and boost compound query; for other compound query will degrade into origin logic and speed.

Feature support query

  • NeuralSparseQuery
  • NeuralSparseQuery nested in BoostQuery
  • NeuralSparseQuery nested in BooleanQuery

Issues Resolved

Resolve #646

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

documentation-website issue

opensearch-project/documentation-website#7289

BWC PR

On the way...

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions github-actions bot added the Features Introduces a new unit of functionality that satisfies a requirement label May 14, 2024
@conggguan conggguan changed the title Search pipeline [Feature] Neural Sparse Query Two Phase Search pipeline May 14, 2024
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments for this PR:

  • please think of simpler and more optimal low level design, logic is a bit overcomplicated
  • formatting, please put calls under if or else conditions into a braces even if it's one liner

float baseBoost
) {
baseBoost *= queryBuilder.boost();
neuralSparseQueryBuilderFloatMap.put(queryBuilder.getCopyNeuralSparseQueryBuilderForTwoPhase(ratio), baseBoost);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment - seems we're creating copy of the NeuralSparceQueryBuilder every time we reach here. In worse case how many times we can hit this place? Do we really need a new copy every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, consolidating all query tokens to construct a single NeuralSparseQueryBuilder is a viable strategy. This approach could potentially increase the overall efficiency of the queries, but it might also complicate the logic. Queries involving a large number of NeuralSparseQuery components may not be very common. Therefore, do you think make a more complicate but more effective logic is essential?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if can keep correctness and improve performance I would vote for such approach. Also that aligns with one of the PRs goals which is improve query latency. Is it possible for you to run a benchmark to see how much latency improvement we can get if we reuse a single object?

@conggguan conggguan force-pushed the search-pipeline branch 2 times, most recently from fd3d634 to 0a7bd74 Compare May 20, 2024 03:36
Copy link

codecov bot commented May 20, 2024

Codecov Report

Attention: Patch coverage is 79.02098% with 30 lines in your changes are missing coverage. Please review.

Project coverage is 84.53%. Comparing base (7c54c86) to head (a93c8cd).
Report is 7 commits behind head on main.

Current head a93c8cd differs from pull request most recent head a53966c

Please upload reports for the commit a53966c to get more accurate results.

Files Patch % Lines
...h/neuralsearch/query/NeuralSparseQueryBuilder.java 60.37% 14 Missing and 7 partials ⚠️
...earch/processor/NeuralSparseTwoPhaseProcessor.java 89.88% 2 Missing and 7 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #747      +/-   ##
============================================
- Coverage     85.02%   84.53%   -0.50%     
- Complexity      790      807      +17     
============================================
  Files            60       61       +1     
  Lines          2430     2534     +104     
  Branches        410      427      +17     
============================================
+ Hits           2066     2142      +76     
- Misses          202      220      +18     
- Partials        162      172      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: conggguan <[email protected]>
if (queryBuilder instanceof BoolQueryBuilder) {
BoolQueryBuilder boolQueryBuilder = (BoolQueryBuilder) queryBuilder;
float updatedBoost = baseBoost * boolQueryBuilder.boost();
for (QueryBuilder subQuery : boolQueryBuilder.should()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In last PR we also supported BoostQuery, and use clause.isScoring() to determine whether to support the boolean clause. Are they same here?

@conggguan conggguan force-pushed the search-pipeline branch 3 times, most recently from 5fa1e74 to 7846f5d Compare May 27, 2024 10:33
Copy link
Member

@zhichao-aws zhichao-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if (queryTokensSupplier != null) {
builder.append(queryTokensSupplier.get());
}
if (twoPhaseSharedQueryToken != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do the null check here

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a bwc for this change, are we covered by anything that already exists or we need a new test?

@conggguan
Copy link
Contributor Author

We need a bwc for this change, are we covered by anything that already exists or we need a new test?

I think we need a BWC test, and it would be better to perform it after this code is merged. Currently, I can't build a BWC test to invoke the code from this PR since it hasn't been merged yet. I will add a BWC test as soon as this PR is merged.

Is this a good solution?

conggguan added 2 commits June 2, 2024 12:39
…e search pipeline to neural sparse query builder.

Signed-off-by: conggguan <[email protected]>
@martin-gaievski
Copy link
Member

We need a bwc for this change, are we covered by anything that already exists or we need a new test?

I think we need a BWC test, and it would be better to perform it after this code is merged. Currently, I can't build a BWC test to invoke the code from this PR since it hasn't been merged yet. I will add a BWC test as soon as this PR is merged.

Is this a good solution?

That works, although I'm not sure what issue you're facing as bwc should be able to use code from active PR. Please make sure the PR with BWC is merged in the same release, which in case of this change is 2.15.

@zhichao-aws zhichao-aws merged commit 2b21110 into opensearch-project:main Jun 4, 2024
71 checks passed
@zhichao-aws zhichao-aws added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Jun 4, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 4, 2024
* Poc of pipeline

Signed-off-by: conggguan <[email protected]>

* Complete some settings for two phase pipeline.

Signed-off-by: conggguan <[email protected]>

* Change the implement of two-phase from QueryBuilderVistor to custom process funciton.

Signed-off-by: conggguan <[email protected]>

* Add It and fix some bug on the state of multy same neuralsparsequerybuilder.

Signed-off-by: conggguan <[email protected]>

* Simplify some logic, and correct some format.

Signed-off-by: conggguan <[email protected]>

* Optimize some format.

Signed-off-by: conggguan <[email protected]>

* Add some test case.

Signed-off-by: conggguan <[email protected]>

* Optimize some logic for zhichao-aws's comments.

Signed-off-by: conggguan <[email protected]>

* Optimize a line without application.

Signed-off-by: conggguan <[email protected]>

* Add some comments, remove some redundant lines, fix some format.

Signed-off-by: conggguan <[email protected]>

* Remove a redundant null check, fix a if format.

Signed-off-by: conggguan <[email protected]>

* Fix a typo for a comment, camelcase format for some variable.

Signed-off-by: conggguan <[email protected]>

* Add some comments to illustrate the influence of the modify on 2-phase search pipeline to neural sparse query builder.

Signed-off-by: conggguan <[email protected]>

---------

Signed-off-by: conggguan <[email protected]>
Signed-off-by: conggguan <[email protected]>
(cherry picked from commit 2b21110)
zane-neo pushed a commit that referenced this pull request Jun 4, 2024
* Poc of pipeline

Signed-off-by: conggguan <[email protected]>

* Complete some settings for two phase pipeline.

Signed-off-by: conggguan <[email protected]>

* Change the implement of two-phase from QueryBuilderVistor to custom process funciton.

Signed-off-by: conggguan <[email protected]>

* Add It and fix some bug on the state of multy same neuralsparsequerybuilder.

Signed-off-by: conggguan <[email protected]>

* Simplify some logic, and correct some format.

Signed-off-by: conggguan <[email protected]>

* Optimize some format.

Signed-off-by: conggguan <[email protected]>

* Add some test case.

Signed-off-by: conggguan <[email protected]>

* Optimize some logic for zhichao-aws's comments.

Signed-off-by: conggguan <[email protected]>

* Optimize a line without application.

Signed-off-by: conggguan <[email protected]>

* Add some comments, remove some redundant lines, fix some format.

Signed-off-by: conggguan <[email protected]>

* Remove a redundant null check, fix a if format.

Signed-off-by: conggguan <[email protected]>

* Fix a typo for a comment, camelcase format for some variable.

Signed-off-by: conggguan <[email protected]>

* Add some comments to illustrate the influence of the modify on 2-phase search pipeline to neural sparse query builder.

Signed-off-by: conggguan <[email protected]>

---------

Signed-off-by: conggguan <[email protected]>
Signed-off-by: conggguan <[email protected]>
(cherry picked from commit 2b21110)

Co-authored-by: conggguan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch Features Introduces a new unit of functionality that satisfies a requirement v2.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Enhancing Neural Sparse Query Speed with a Two-Phase Approach
4 participants