-
Notifications
You must be signed in to change notification settings - Fork 56
Support k-NN similarity functions in painless scripting #281
Support k-NN similarity functions in painless scripting #281
Conversation
Painless scripting uses IndexFieldDataBuilder to load field data values. Hence, override fielddataBuilder to return instance of KNNVectorIndexFieldData.Builder. Subsequently, this builder is responsible for returning ScriptDocValues which will be used as argument in our Whitlisting knn custome methods.
Added new methods which will be whitelisted for users to call in source field for painless scripting.
Add Extension class and return list of whitelisted method. Update gradle to add dependencies to painless.
Following methods are whitelisted: @vamshin @jmazanec15 we can discuss here if we want to rename the method name for better reach |
Codecov Report
@@ Coverage Diff @@
## master #281 +/- ##
============================================
+ Coverage 79.27% 80.38% +1.11%
- Complexity 359 388 +29
============================================
Files 58 62 +4
Lines 1404 1458 +54
Branches 126 127 +1
============================================
+ Hits 1113 1172 +59
+ Misses 243 239 -4
+ Partials 48 47 -1
Continue to review full report at Codecov.
|
test annotation is forbidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Still need to dig into the test cases more. One note, once #222 is resolved, we will need to add a section on this.
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorFieldMapper.java
Show resolved
Hide resolved
@@ -50,7 +90,7 @@ public static float l2Squared(float[] queryVector, float[] inputVector) { | |||
* @return cosine score | |||
*/ | |||
public static float cosinesimilOptimized(float[] queryVector, float[] inputVector, float normQueryVector) { | |||
float dotProduct = 0.0f; | |||
float dotProduct = 0.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this indented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACk. Will apply reformat code in next commit.
src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringUtil.java
Show resolved
Hide resolved
* @param normQueryVector normalized query vector value. | ||
* @return cosine score | ||
*/ | ||
public static float cosineSimilarityOptimized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this interface is vague: cosineSimilarityOptimized does not give a user any information about how this function differs from cosineSimilarity. If cosineSimilarity can be optimized, why even provide cosineSimilarity?
The Cosine similarity is A . B / (||A|| x ||B||). The difference between the two functions is that cosineSimilarityOptimized has the user pass in both the query vector and the magnitude of the query vector. Assuming the query vector is A, ||A|| does not change throughout the query. So time is saved by computing ||A|| separately and then passing it into the function. Also, I think normQueryVector may not be the appropriate term. Normalization refers to the process of making a vector have a magnitude of one.
My proposal is to switch it to the following interface:
cosineSimilarity(List <Number> queryVector, KNNVectorScriptDocValues docValues, Number queryVectorMagnitude)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorScriptDocValues.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorScriptDocValues.java
Show resolved
Hide resolved
...test/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorDVLeafFieldDataTests.java
Outdated
Show resolved
Hide resolved
src/test/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorIndexFieldDataTests.java
Outdated
Show resolved
Hide resolved
...test/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorScriptDocValuesTests.java
Outdated
Show resolved
Hide resolved
1. Made KNNScriptDocValues thread safe 2. reformat code 3. rename cosinesimiloptimized.
Add painless scripting score to verify whether following methods are available for users. 1. l2Squared 2. cosineSimilarity(queryVector,doc[field]) 3. cosineSimilarity(queryVector,doc[field],normalizedVector)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think test cases look pretty good. I added a few additional comments in implementation.
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorFieldMapper.java
Show resolved
Hide resolved
|
||
public synchronized float[] getValue() throws IOException { | ||
if (!docExists) { | ||
throw new IllegalArgumentException("no value found for the corresponding doc ID"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why throw IllegalArgumentException? It seems like it maybe should be an IllegalStateException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorScriptDocValues.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorDVLeafFieldData.java
Show resolved
Hide resolved
float[] knnDocVector; | ||
try { | ||
knnDocVector = docValues.getValue(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we catch the specific exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getValue can throw RuntimeException & IOException, i see three option here
- Let docValues throw exception as it is and catch all individual exception here and return Float.Min like now,
- Update getValue to throw only RuntimeException, and catch only RuntimeException here
- Just update catch (Exception e) to catch ( RuntimeException | IOException e )
i don't see any difference in all three since at the end outcome is same, but if you have strong opinion on one vs other, i can change it . if you see any other option i can make change it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, Option 1 us good with me
src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringUtil.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringUtil.java
Show resolved
Hide resolved
src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringUtil.java
Outdated
Show resolved
Hide resolved
src/test/java/com/amazon/opendistroforelasticsearch/knn/KNNRestTestCase.java
Show resolved
Hide resolved
1. updated exception type. 2. Added test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me!
src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorScriptDocValues.java
Show resolved
Hide resolved
float[] knnDocVector; | ||
try { | ||
knnDocVector = docValues.getValue(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, Option 1 us good with me
|
||
@Override | ||
public void setNextDocId(int docId) throws IOException { | ||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need synchronized
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added thread safe to prevent any race condition based on feedback, since docExists setting is happening in one place and reading is happening on another place. If there is not threat with race condition then definitely we can remove this. At the end, it comes to do we need to make this instance thread safe or not? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Do you have pointers to any other base classes extending ScriptDocValues<>
doing synchronization? My understanding is each search request should have its own instance and documents are iterated sequentially so synchronization should not be required. If you see any base class doing this we can definitely consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed synchronization.
|
||
@Override | ||
public int size() { | ||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need synchronized
here? Avoid synchronization in the places not needed as it could hamper performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added so that docExists is synchronized. Since it is just one instruction, i think i can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. I have not seen other places doing this. We can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
|
||
private Map<String, Float[]> getL2TestData() { | ||
Map<String, Float[]> data = new HashMap<>(); | ||
data.put("1", new Float[]{6.0f, 6.0f}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we have a document without the vector and then assert that particular doc comes at the end of the result? Same for both CosineTestData
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test data.
1. Requires dimension must be consistent between query vector and input vector 2. Remove thread safe for KNNScriptDocValues 3. Throw exception if document doesn't contains vector field.
To avoid exception, user's can use size method to validate.
Include error message with additional details on how to fix the problem.
Add field name as part of error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Use scriptdoc to extract values from segment.
Use KNNScriptDocValues to retrieve float[] from the document instead of explicity retrieval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding painless support. This will help users to have more customization on the scoring.
*Issue #213
Implement IndexFieldDataBuilder to allow painless scripting to indenitfy KNNScriptDocValues
Add dependency to PainlessExtension to add custom scoring methods to be exposed for usage by users inside painless scripting.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.