-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the Lucene Distance Calculation Function in Script Scoring for doing exact search #1699
Use the Lucene Distance Calculation Function in Script Scoring for doing exact search #1699
Conversation
…ing exact search Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
int numZeroInInput = 0; | ||
int numZeroInQuery = 0; | ||
float cosine = 0.0f; | ||
for (int i = 0; i < inputVector.length; i++) { | ||
if (inputVector[i] == 0) { | ||
numZeroInInput++; | ||
} | ||
|
||
if (queryVector[i] == 0) { | ||
numZeroInQuery++; | ||
} | ||
} | ||
float normalizedProduct = normQueryVector * normInputVector; | ||
if (normalizedProduct == 0) { | ||
if (numZeroInInput == inputVector.length || numZeroInQuery == queryVector.length) { | ||
return cosine; | ||
} | ||
try { | ||
cosine = VectorUtil.cosine(queryVector, inputVector); | ||
} catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did lucene doesn't have cosine functions directly present which we can leverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the lucene cosine function on line 159. The rest just returns 0 if either the input or query vectors are all 0's.
normInputVector += inputVector[i] * inputVector[i]; | ||
} | ||
float normalizedProduct = normQueryVector * normInputVector; | ||
if (normalizedProduct == 0) { | ||
logger.debug("Invalid vectors for cosine. Returning minimum score to put this result to end"); | ||
return 0.0f; | ||
} | ||
return (float) (dotProduct / (Math.sqrt(normalizedProduct))); | ||
return (float) (VectorUtil.dotProduct(queryVector, inputVector) / (Math.sqrt(normalizedProduct))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to use dotProduct / normalize for calculate cosine. this would do one more iteration as original L108 doing dotProduct.
PS, i checked Lucene#DefaultVectorUtilSupport#cosine(float, float)
would do cosine normalize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used that below, I'll see if I can get it to work with the normVector present in this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see there is return (float) (sum / Math.sqrt((double) norm1 * (double) norm2));
in Lucene#DefaultVectorUtilSupport#cosine(float, float)
so we can use it directly in public static float cosinesimilOptimized
and without using dotProduct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you seeing the DefaultVectorUtilSupport
class? I've only been able to find VectorUtil
so far and that class doesn't have a cosine method that takes floats, only float[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you seeing the DefaultVectorUtilSupport class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's also used in KNNScoringSpace as well: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/plugin/script/KNNScoringSpace.java#L106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIMD would be better as per out older experiments of SIMD. Also, given that lucene lacks that implementation I fine to remove this optimize cosine code for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see multiple versions of cosinesimil
and cosineSimilarity
. Lets just move towards 1 where we use Lucene functions to do the distance calculations and remove all others.
Some are using optimized and some doesn't. Lets just clean things up and move towards 1 implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll incorporate that with this PR then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO for those that are very serious about performance, they will normalize their data during preprocessing and use inner product directly. So, I think its okay to not change cosine functionality for now and just focus on dot product and l2 for this optimization.
dotProduct += queryVector[i] * inputVector[i]; | ||
normQueryVector += queryVector[i] * queryVector[i]; | ||
normInputVector += inputVector[i] * inputVector[i]; | ||
int numZeroInInput = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are unnecessary: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L79. Can we just do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method would still return true for a zero vector right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe cosine will be infinite if one vector is finite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we validate that it's not zero vector in the above method, we should be able to remove the other check because of the assert finite
Signed-off-by: Ryan Bogan <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1699 +/- ##
============================================
- Coverage 84.93% 84.92% -0.01%
+ Complexity 1460 1459 -1
============================================
Files 177 178 +1
Lines 5860 5879 +19
Branches 597 594 -3
============================================
+ Hits 4977 4993 +16
- Misses 632 635 +3
Partials 251 251 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Ryan Bogan <[email protected]>
Lucene VectorUtil CosineSimilOptimized microbenchmarks No changes:
Using VectorUtil:
|
@ryanbogan so what the conclusion? and I am seeing you are using function with name: transferVectors_withCapacity is that a typo that you didn't change the name of the function while running benchmarks? |
I just ran the base version of the microbenchmarks, which should run everything right? |
@ryanbogan No, the cosinesimil is not covered by those. Lets just leave cosinesimilOptimized untouched for now. For users really concerned about performance, they should normalize vectors during ingestion and then use dotProduct. |
This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]>
dcb4f47
to
f5b76cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ing exact search (#1699) * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search Signed-off-by: Ryan Bogan <[email protected]> * Add Changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Fix failing test Signed-off-by: Ryan Bogan <[email protected]> * fix test Signed-off-by: Ryan Bogan <[email protected]> * Fix test bug and remove unnecessary validation Signed-off-by: Ryan Bogan <[email protected]> * Remove cosineSimilOptimized Signed-off-by: Ryan Bogan <[email protected]> * Revert "Remove cosineSimilOptimized" This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> (cherry picked from commit 7a88f40)
…ing exact search (#1699) (#1717) * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search Signed-off-by: Ryan Bogan <[email protected]> * Add Changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Fix failing test Signed-off-by: Ryan Bogan <[email protected]> * fix test Signed-off-by: Ryan Bogan <[email protected]> * Fix test bug and remove unnecessary validation Signed-off-by: Ryan Bogan <[email protected]> * Remove cosineSimilOptimized Signed-off-by: Ryan Bogan <[email protected]> * Revert "Remove cosineSimilOptimized" This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> (cherry picked from commit 7a88f40) Co-authored-by: Ryan Bogan <[email protected]>
* Fix flaky test in Faiss JNI range search (#1705) Signed-off-by: Junqiu Lei <[email protected]> * Support script score when doc value is disabled and fix misusing DISI (#1696) * Revert "Revert 'Support script score when doc value is disabled' (#1662)" This reverts commit bd2f403. Signed-off-by: panguixin <[email protected]> * fix misusing doc value Signed-off-by: panguixin <[email protected]> * add changelog Signed-off-by: panguixin <[email protected]> --------- Signed-off-by: panguixin <[email protected]> * --- (#1712) updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update threshold value after new result is added (#1715) Signed-off-by: Heemin Kim <[email protected]> * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search (#1699) * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search Signed-off-by: Ryan Bogan <[email protected]> * Add Changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Fix failing test Signed-off-by: Ryan Bogan <[email protected]> * fix test Signed-off-by: Ryan Bogan <[email protected]> * Fix test bug and remove unnecessary validation Signed-off-by: Ryan Bogan <[email protected]> * Remove cosineSimilOptimized Signed-off-by: Ryan Bogan <[email protected]> * Revert "Remove cosineSimilOptimized" This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> * Add validation for pq m parameter before training starts (#1713) * Add validation for pq code count before training starts Signed-off-by: Ryan Bogan <[email protected]> * Add integration test Signed-off-by: Ryan Bogan <[email protected]> * Add unit tests Signed-off-by: Ryan Bogan <[email protected]> * Clean up code Signed-off-by: Ryan Bogan <[email protected]> * Remove unnecessary lines Signed-off-by: Ryan Bogan <[email protected]> * Add changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Change framework to add validation with data Signed-off-by: Ryan Bogan <[email protected]> * Remove unused error message Signed-off-by: Ryan Bogan <[email protected]> * Add unit tests Signed-off-by: Ryan Bogan <[email protected]> * Change space type check name for readability Signed-off-by: Ryan Bogan <[email protected]> * Add javadocs Signed-off-by: Ryan Bogan <[email protected]> * Modify validation error wording and add json structure to tests Signed-off-by: Ryan Bogan <[email protected]> * Change TrainingDataSpec to VectorSpaceInfo Signed-off-by: Ryan Bogan <[email protected]> * Add unit tests Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> * Updating the BWC test config after 2.14 release (#1724) Signed-off-by: Navneet Verma <[email protected]> --------- Signed-off-by: Junqiu Lei <[email protected]> Signed-off-by: panguixin <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Navneet Verma <[email protected]> Co-authored-by: Junqiu Lei <[email protected]> Co-authored-by: panguixin <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Heemin Kim <[email protected]> Co-authored-by: Ryan Bogan <[email protected]> Co-authored-by: Navneet Verma <[email protected]>
…ing exact search (opensearch-project#1699) (opensearch-project#1717) * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search Signed-off-by: Ryan Bogan <[email protected]> * Add Changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Fix failing test Signed-off-by: Ryan Bogan <[email protected]> * fix test Signed-off-by: Ryan Bogan <[email protected]> * Fix test bug and remove unnecessary validation Signed-off-by: Ryan Bogan <[email protected]> * Remove cosineSimilOptimized Signed-off-by: Ryan Bogan <[email protected]> * Revert "Remove cosineSimilOptimized" This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> (cherry picked from commit 7a88f40) Co-authored-by: Ryan Bogan <[email protected]>
…ing exact search (opensearch-project#1699) * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search Signed-off-by: Ryan Bogan <[email protected]> * Add Changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Fix failing test Signed-off-by: Ryan Bogan <[email protected]> * fix test Signed-off-by: Ryan Bogan <[email protected]> * Fix test bug and remove unnecessary validation Signed-off-by: Ryan Bogan <[email protected]> * Remove cosineSimilOptimized Signed-off-by: Ryan Bogan <[email protected]> * Revert "Remove cosineSimilOptimized" This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]>
…ing exact search (opensearch-project#1699) * Use the Lucene Distance Calculation Function in Script Scoring for doing exact search Signed-off-by: Ryan Bogan <[email protected]> * Add Changelog entry Signed-off-by: Ryan Bogan <[email protected]> * Fix failing test Signed-off-by: Ryan Bogan <[email protected]> * fix test Signed-off-by: Ryan Bogan <[email protected]> * Fix test bug and remove unnecessary validation Signed-off-by: Ryan Bogan <[email protected]> * Remove cosineSimilOptimized Signed-off-by: Ryan Bogan <[email protected]> * Revert "Remove cosineSimilOptimized" This reverts commit f872d83. Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]>
Description
Continuation of #1287
This PR changes the implementation of our script scoring calculations to utilize Lucene's VectorUtil class.
Issues Resolved
#1032
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.