-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the signature of vector script functions. #48604
Update the signature of vector script functions. #48604
Conversation
Pinging @elastic/es-search (:Search/Search) |
3510636
to
7a05972
Compare
7a05972
to
6330119
Compare
6330119
to
c6f3f6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one small ask in the tests
e = expectThrows(IllegalArgumentException.class, l2norm2::l2norm); | ||
assertThat(e.getMessage(), containsString("query vector has a different number of dimensions [2] than the document vectors [5]")); | ||
|
||
assertWarnings(ScoreScriptUtils.DEPRECATION_MESSAGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split these tests into separate test functions? Each should be asserting a warning is being emitted, instead of this one check at the end that only guarantees one of them had a deprecation warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, I made sure to assertWarnings
after every function invocation. I plan to refactor this test in a follow-up PR (as it's a reasonably big change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a follow-up PR here: #48662
@@ -74,20 +105,21 @@ public void validateDocVector(BytesRef vector) { | |||
throw new IllegalArgumentException("The query vector has a different number of dimensions [" + | |||
queryVector.length + "] than the document vectors [" + vectorLength + "]."); | |||
} | |||
return vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but I think we can do the vectorLength validation on lines 103-108 just for the first doc, and then set a flag validated
= true, and skip this validations for further docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this something that needs to be checked for each doc? Do we gaurantee at indexing time all vector fields have the same length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjernst yes, we guarantee at indexing time that all vector fields have the same length. Specifically, there is a mapping parameter dims
, and all vector fields must match that number of dimensions.
Earlier I ran benchmarks to measure the cost of length validation, and it was very small (couldn't detect it in the results). So I'm not sure it's worth adding an extra guard + check here for performance. But if the scripting framework had a way to perform a one-time validation, it could make this code a bit cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks Julie. I assume this PR targets 7.6. For 8.0 we want to remove the outdated versions of vector script functions and keep only functions with signatures function(org.elasticsearch.script.ScoreScript, List, String)
?
That's right. The way I typically do deprecations is to merge the deprecation into master, backport to 7.x, then follow-up on master with the removal. |
Thanks @rjernst and @mayya-sharipova for the reviews. |
Follow up to #48604. This PR removes the deprecated vector function signatures of the form `cosineSimilarity(query, doc['field'])`.
Previously the functions accepted a doc values reference, whereas they now
accept the name of the vector field. Here's an example of how a vector function
was called before and after the change.
This seems more intuitive, since we don't allow direct access to vector doc
values and the the meaning of
doc['field']
is unclear.The PR makes the following changes (broken into distinct commits):
function(params.query_vector, 'field')
and deprecates the old ones. Because Painless doesn't allow twomethods with the same name and number of arguments, we allow a generic
Object
to be passed in to the function and decide on the behavior through an
instanceof
check.constructor instead of the instance method. This allows us to avoid retrieving
the vector doc values on every function invocation, which gives a tiny speed-up
in benchmarks.
Note that this PR adds new signatures for the sparse vector functions too, even
though sparse vectors are deprecated. It seemed simplest to understand (for both
us and users) to keep everything symmetric between dense and sparse vectors.