You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using SPLADE together with sentence-transformers/multi-qa-mpnet-base-cos-v1 SentenceTransformer to create hybrid embeddings for use in Pinecone's sparse-dense indexes.
The sparse-dense indexes can only use dotproduct similarity, which is why I chose a dense model trained with cosine similarity. This means I get back dense embeddings with L2 norm of 1 and dot product similarity in range [-1, 1] which I can easily rescale to the unit interval. Based on my somewhat limited understanding, this seems like a relatively sound approach to getting scores which our users can understand as % similarity (assuming in distribution).
After transitioning to sparse-dense vectors, I noticed that SPLADE does not produce normalized embeddings, which means this approach no longer works. I thought about normalizing the SPLADE embeddings, but I'm not sure how this would affect performance.
On a separate note, I'm using Pinecone's convex combination
# alpha in range [0, 1]embedding.sparse.values= [
value* (1-alpha) forvalueinembedding.sparse.values
]
embedding.dense= [value*alphaforvalueinembedding.dense]
Although it's usually recommended to use the same similarity metric as used in training but if you see Splade's transformers wrapper, they deliberately supports cosine similarity.
hi @adri1wald ,
If you try to normalize SPLADE embeddings after training, this won't work (as pointed out by @mu4farooqi )
We indeed support cosine similarity -- but this is more a legacy of our initial experiments with dense models. I remember trying at some point some normalization schemes for SPLADE (as part of training), and the results were not so good.
Hi!
I'm using SPLADE together with
sentence-transformers/multi-qa-mpnet-base-cos-v1
SentenceTransformer to create hybrid embeddings for use in Pinecone's sparse-dense indexes.The sparse-dense indexes can only use
dotproduct
similarity, which is why I chose a dense model trained with cosine similarity. This means I get back dense embeddings with L2 norm of 1 and dot product similarity in range[-1, 1]
which I can easily rescale to the unit interval. Based on my somewhat limited understanding, this seems like a relatively sound approach to getting scores which our users can understand as% similarity
(assuming in distribution).After transitioning to sparse-dense vectors, I noticed that SPLADE does not produce normalized embeddings, which means this approach no longer works. I thought about normalizing the SPLADE embeddings, but I'm not sure how this would affect performance.
On a separate note, I'm using Pinecone's convex combination
I am struggling to reason about how all of this interacts and what effect it has on ranking. See here for info on how pinecone's score is calculated and here for more details about their convex combination logic.
Any help understanding this stuff would be hugely appreciated 🙌
Cheers!
The text was updated successfully, but these errors were encountered: