Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

[Question]How cosine similarity distance is calculated when use approximate search #325

Closed
yana2301 opened this issue Feb 22, 2021 · 7 comments

Comments

@yana2301
Copy link

yana2301 commented Feb 22, 2021

Hi, could you please provide more details on how cosine similarity distance is calculated when an approximate search is used.

Documentations states the following:

"From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. To convert distances to Elasticsearch scores, we take 1 / (1 + distance)"

However, cosine similarity means cosine of the angle between vectors. Its higher value means that vectors are similar. So:

angle = 0 cosine similarity = 1 - vectors are very similar
angle = 90 cosine similarity = 0 - vectors are not very similar

So if ES score = 1/(1+cosine similarity) - then it will have a lower value for similar vectors and a higher value for the less similar vectors.
Also, I tried some simple dataset for an approximate search I'm getting following results:

query vector = [ 2.3,3.4,1.2]
vector in dataset = [1.5, 2.5, 1.2]

cosine similarity = (2.3*1.5 + 3.4*2.5+1.2*1.2)/(sqrt(2.3*2.3+3.4*3.4+1.2*1.2)*sqrt(1.5*1.5+2.5*2.5+1.2*1.2) = 0.99307153144

ES score according to formula = 1/(1+0.99307153144) = 0.50173813845
actual ES score = 0.45368767

Could you please provide an exact formula for how ES score is calculated?

@jmazanec15
Copy link
Member

Hi @yana2301, yes you are correct - the documentation is wrong. I will make update to that.

So nmslib, for their cosine similarity space, returns 1 - cosine similarity: https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc#L80.

Here, in the k-NN plugin, we compute the Elasticsearch score: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNWeight.java#L113

So whats actually being computed is 1/(1 + (1 - cosine)) Does that make sense?

I will correct documentation to include this.

@yana2301
Copy link
Author

@jmazanec15 thanks for the response! But still 1/(1 + ( 1 - 0.99307153144)) != 0.45368767

@yana2301
Copy link
Author

Here's the schema, index data and request that I use to get such results

schema.txt
request.txt
bulk.txt

@jmazanec15
Copy link
Member

Yes, correct. But ordering is maintained and thats what we want with the Elasticsearch score.

We can't actually make the Elasticsearch exactly cosineSimilarity because Elasticsearch scores cannot be negative. However, for our scoring script, we take 1 +cosineSimilarity: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringSpace.java#L102. We could do this for approximate k-NN search and it would be clearer, but ordering would be the same as 1/(1 + (1 - cosineSimilarity)), so we have not made that change.

@yana2301
Copy link
Author

@jmazanec15 I've found out the reason for such metric calculation - #288 .
I'm using ODFE 1.11, I see that it has been fixed in 1.13. I'm using the Amazon Elasticsearch service - thus, I cannot migrate to version 1.13 quickly.
Could you tell me please, are there any workarounds to make approximate search return score according to formula 1/(1 + (1 - cosineSimilarity)) in version 1.11?
The reason why I need this is that we also specify min_score in the queries - so that documents that have low scores wouldn't be returned. Our thresholds and vectors are built using cosine similarity metric. I'm afraid that we would have to rebuild the whole model if the actual formula uses L2.

@jmazanec15
Copy link
Member

Oh apologies, I did not read the question correctly.

Yes that fix has been backported to Amazon Elasticsearch Service and ODFE 1.11 - #293.

To make sure your cluster has the fix, make sure you have taken most recent software upgrade.

Then, you will need to reindex your indices to rebuild the graphs using cosine similarity. Please let us know if there is any issue with this.

@yana2301
Copy link
Author

great, thanks for the clarification!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants