Normalizing SPLADE embeddings - a bad idea? #34

adri1wald · 2023-03-30T00:53:44Z

Hi!

I'm using SPLADE together with sentence-transformers/multi-qa-mpnet-base-cos-v1 SentenceTransformer to create hybrid embeddings for use in Pinecone's sparse-dense indexes.

The sparse-dense indexes can only use dotproduct similarity, which is why I chose a dense model trained with cosine similarity. This means I get back dense embeddings with L2 norm of 1 and dot product similarity in range [-1, 1] which I can easily rescale to the unit interval. Based on my somewhat limited understanding, this seems like a relatively sound approach to getting scores which our users can understand as % similarity (assuming in distribution).

After transitioning to sparse-dense vectors, I noticed that SPLADE does not produce normalized embeddings, which means this approach no longer works. I thought about normalizing the SPLADE embeddings, but I'm not sure how this would affect performance.

On a separate note, I'm using Pinecone's convex combination

# alpha in range [0, 1]
embedding.sparse.values = [
    value * (1 - alpha) for value in embedding.sparse.values
]
embedding.dense = [value * alpha for value in embedding.dense]

I am struggling to reason about how all of this interacts and what effect it has on ranking. See here for info on how pinecone's score is calculated and here for more details about their convex combination logic.

Any help understanding this stuff would be hugely appreciated 🙌

Cheers!

The text was updated successfully, but these errors were encountered:

mu4farooqi · 2023-04-02T16:32:41Z

Although it's usually recommended to use the same similarity metric as used in training but if you see Splade's transformers wrapper, they deliberately supports cosine similarity.

thibault-formal · 2023-04-18T08:36:24Z

hi @adri1wald ,
If you try to normalize SPLADE embeddings after training, this won't work (as pointed out by @mu4farooqi )

We indeed support cosine similarity -- but this is more a legacy of our initial experiments with dense models. I remember trying at some point some normalization schemes for SPLADE (as part of training), and the results were not so good.

hope it helps!

thibault-formal · 2023-05-24T06:53:46Z

Closing the issue, feel free to re-open!

thibault-formal closed this as completed May 24, 2023

ShravanSunder mentioned this issue Apr 25, 2024

Hybrid search & normalization #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizing SPLADE embeddings - a bad idea? #34

Normalizing SPLADE embeddings - a bad idea? #34

adri1wald commented Mar 30, 2023 •

edited

Loading

mu4farooqi commented Apr 2, 2023

thibault-formal commented Apr 18, 2023

thibault-formal commented May 24, 2023

Normalizing SPLADE embeddings - a bad idea? #34

Normalizing SPLADE embeddings - a bad idea? #34

Comments

adri1wald commented Mar 30, 2023 • edited Loading

mu4farooqi commented Apr 2, 2023

thibault-formal commented Apr 18, 2023

thibault-formal commented May 24, 2023

adri1wald commented Mar 30, 2023 •

edited

Loading