[Question]How cosine similarity distance is calculated when use approximate search #325

yana2301 · 2021-02-22T14:03:44Z

Hi, could you please provide more details on how cosine similarity distance is calculated when an approximate search is used.

Documentations states the following:

"From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. To convert distances to Elasticsearch scores, we take 1 / (1 + distance)"

However, cosine similarity means cosine of the angle between vectors. Its higher value means that vectors are similar. So:

angle = 0 cosine similarity = 1 - vectors are very similar
angle = 90 cosine similarity = 0 - vectors are not very similar

So if ES score = 1/(1+cosine similarity) - then it will have a lower value for similar vectors and a higher value for the less similar vectors.
Also, I tried some simple dataset for an approximate search I'm getting following results:

query vector = [ 2.3,3.4,1.2]
vector in dataset = [1.5, 2.5, 1.2]

cosine similarity = (2.3*1.5 + 3.4*2.5+1.2*1.2)/(sqrt(2.3*2.3+3.4*3.4+1.2*1.2)*sqrt(1.5*1.5+2.5*2.5+1.2*1.2) = 0.99307153144

ES score according to formula = 1/(1+0.99307153144) = 0.50173813845
actual ES score = 0.45368767

Could you please provide an exact formula for how ES score is calculated?

The text was updated successfully, but these errors were encountered:

jmazanec15 · 2021-02-22T18:01:37Z

Hi @yana2301, yes you are correct - the documentation is wrong. I will make update to that.

So nmslib, for their cosine similarity space, returns 1 - cosine similarity: https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc#L80.

Here, in the k-NN plugin, we compute the Elasticsearch score: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNWeight.java#L113

So whats actually being computed is 1/(1 + (1 - cosine)) Does that make sense?

I will correct documentation to include this.

yana2301 · 2021-02-22T18:09:54Z

@jmazanec15 thanks for the response! But still 1/(1 + ( 1 - 0.99307153144)) != 0.45368767

yana2301 · 2021-02-22T18:17:24Z

Here's the schema, index data and request that I use to get such results

schema.txt
request.txt
bulk.txt

jmazanec15 · 2021-02-22T23:08:22Z

Yes, correct. But ordering is maintained and thats what we want with the Elasticsearch score.

We can't actually make the Elasticsearch exactly cosineSimilarity because Elasticsearch scores cannot be negative. However, for our scoring script, we take 1 +cosineSimilarity: https://github.com/opendistro-for-elasticsearch/k-NN/blob/main/src/main/java/com/amazon/opendistroforelasticsearch/knn/plugin/script/KNNScoringSpace.java#L102. We could do this for approximate k-NN search and it would be clearer, but ordering would be the same as 1/(1 + (1 - cosineSimilarity)), so we have not made that change.

yana2301 · 2021-02-23T20:40:10Z

@jmazanec15 I've found out the reason for such metric calculation - #288 .
I'm using ODFE 1.11, I see that it has been fixed in 1.13. I'm using the Amazon Elasticsearch service - thus, I cannot migrate to version 1.13 quickly.
Could you tell me please, are there any workarounds to make approximate search return score according to formula 1/(1 + (1 - cosineSimilarity)) in version 1.11?
The reason why I need this is that we also specify min_score in the queries - so that documents that have low scores wouldn't be returned. Our thresholds and vectors are built using cosine similarity metric. I'm afraid that we would have to rebuild the whole model if the actual formula uses L2.

jmazanec15 · 2021-02-24T18:06:58Z

Oh apologies, I did not read the question correctly.

Yes that fix has been backported to Amazon Elasticsearch Service and ODFE 1.11 - #293.

To make sure your cluster has the fix, make sure you have taken most recent software upgrade.

Then, you will need to reindex your indices to rebuild the graphs using cosine similarity. Please let us know if there is any issue with this.

yana2301 · 2021-02-24T18:42:47Z

great, thanks for the clarification!

jmazanec15 mentioned this issue Feb 22, 2021

Fix cosine similarity definition opendistro/for-elasticsearch-docs#425

Merged

yana2301 closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]How cosine similarity distance is calculated when use approximate search #325

[Question]How cosine similarity distance is calculated when use approximate search #325

yana2301 commented Feb 22, 2021 •

edited

Loading

jmazanec15 commented Feb 22, 2021

yana2301 commented Feb 22, 2021

yana2301 commented Feb 22, 2021

jmazanec15 commented Feb 22, 2021

yana2301 commented Feb 23, 2021

jmazanec15 commented Feb 24, 2021

yana2301 commented Feb 24, 2021

[Question]How cosine similarity distance is calculated when use approximate search #325

[Question]How cosine similarity distance is calculated when use approximate search #325

Comments

yana2301 commented Feb 22, 2021 • edited Loading

jmazanec15 commented Feb 22, 2021

yana2301 commented Feb 22, 2021

yana2301 commented Feb 22, 2021

jmazanec15 commented Feb 22, 2021

yana2301 commented Feb 23, 2021

jmazanec15 commented Feb 24, 2021

yana2301 commented Feb 24, 2021

yana2301 commented Feb 22, 2021 •

edited

Loading