-
Notifications
You must be signed in to change notification settings - Fork 56
[Question]How cosine similarity distance is calculated when use approximate search #325
Comments
Hi @yana2301, yes you are correct - the documentation is wrong. I will make update to that. So nmslib, for their cosine similarity space, returns Here, in the k-NN plugin, we compute the Elasticsearch score: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNWeight.java#L113 So whats actually being computed is I will correct documentation to include this. |
@jmazanec15 thanks for the response! But still 1/(1 + ( 1 - 0.99307153144)) != 0.45368767 |
Here's the schema, index data and request that I use to get such results |
Yes, correct. But ordering is maintained and thats what we want with the Elasticsearch score. We can't actually make the Elasticsearch exactly cosineSimilarity because Elasticsearch scores cannot be negative. However, for our scoring script, we take |
@jmazanec15 I've found out the reason for such metric calculation - #288 . |
Oh apologies, I did not read the question correctly. Yes that fix has been backported to Amazon Elasticsearch Service and ODFE 1.11 - #293. To make sure your cluster has the fix, make sure you have taken most recent software upgrade. Then, you will need to reindex your indices to rebuild the graphs using cosine similarity. Please let us know if there is any issue with this. |
great, thanks for the clarification! |
Hi, could you please provide more details on how cosine similarity distance is calculated when an approximate search is used.
Documentations states the following:
"From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. To convert distances to Elasticsearch scores, we take 1 / (1 + distance)"
However, cosine similarity means cosine of the angle between vectors. Its higher value means that vectors are similar. So:
angle = 0 cosine similarity = 1 - vectors are very similar
angle = 90 cosine similarity = 0 - vectors are not very similar
So if ES score = 1/(1+cosine similarity) - then it will have a lower value for similar vectors and a higher value for the less similar vectors.
Also, I tried some simple dataset for an approximate search I'm getting following results:
query vector = [ 2.3,3.4,1.2]
vector in dataset = [1.5, 2.5, 1.2]
cosine similarity =
(2.3*1.5 + 3.4*2.5+1.2*1.2)/(sqrt(2.3*2.3+3.4*3.4+1.2*1.2)*sqrt(1.5*1.5+2.5*2.5+1.2*1.2) = 0.99307153144
ES score according to formula = 1/(1+0.99307153144) = 0.50173813845
actual ES score = 0.45368767
Could you please provide an exact formula for how ES score is calculated?
The text was updated successfully, but these errors were encountered: