[`feat`] Add truncation support #2573

kddubey · 2024-04-03T16:45:09Z

Hello,

This PR follows up on the discussion in #2564.

The implementation adds an optional SentenceTransformer instance attribute, output_dim, so that:

the dimension of model.encode(texts) agrees with model.get_sentence_embedding_dimension()
the user doesn't need to change every model.encode(texts) call to achieve truncation
no changes need to be made to third-party code which internally does model.encode(texts) (e.g., the EmbeddingSimilarityEvaluator) to achieve truncation.

In case a user wants to change the truncation dimension on the fly, they can use these new utilities:

model.truncate_sentence_embeddings: context manager that sets and resets the truncation dimension, output_dim, of the model. This has the 3 benefits above
util.truncate_embeddings: just a simple truncation function that works even if the input embeddings is 1-D or 2-D. May be useful because ...-slicing is slightly niche knowledge.

How has this code been tested?

New unit test:

pytest tests/test_sentence_transformer.py -k test_encode_truncate -x

It adds ~30 sec. to the testing workflow.

sentence_transformers/SentenceTransformer.py

tomaarsen

Great work! This is a really strong PR. I've made some comments, though I believe only 1 warrants a discussion (regarding truncate_dim).

examples/training/matryoshka/README.md

sentence_transformers/SentenceTransformer.py

tomaarsen

I think this is looking solid now. I'll experiment with these changes locally myself to verify that it all works as expected, and perhaps I'll see if it also works correctly with some of our third party applications (langchain, llamaindex, etc.). I'll also double-check that the new methods are correctly included in the built docs.

sentence_transformers/SentenceTransformer.py

tomaarsen · 2024-04-04T20:42:03Z

Another note, it might make sense to add support for truncate_dim here as well. Then users can write:

evaluators = []
for truncate_dim in [64, 128, 256, 512, 768]:
    evaluators.append(evaluation.EmbeddingSimilarityEvaluator(
        stsb_dev["sentence1"],
        stsb_dev["sentence2"],
        stsb_dev["score"],
        name="sts-dev-{truncate_dim}",
        truncate_dim=truncate_dim,
    ))
evaluator = evaluation.SequentialEvaluator(evaluators)

to get evaluations during training at different matryoshka dimensions. There might be other evaluators where it could be applied, but EmbeddingSimilarityEvaluator is most commonly used I believe. What do you think?

Tom Aarsen

kddubey · 2024-04-04T20:55:49Z

Interesting, first time I've seen SequentialEvaluator. Given that, a truncate_dim parameter would be necessary. I'll add it to the EmbeddingSimilarityEvaluator and others where I can see model.encode getting called (update).

Update: 9606521 implements truncation for EmbeddingSimilarityEvaluator

Quick question: in the __call__ signature of evaluators (example), the model isn't type-annotated. Is this b/c we shouldn't assume it's a SentenceTransformer, or b/c importing SentenceTransformer here causes an ImportError I'm not seeing, or it just hasn't been type-annotated yet? If it's the last reason, should I add the type annotation now?

tomaarsen · 2024-04-05T06:40:03Z

It just hasn't been type annotated, though I wouldn't be surprised if it results in a circular import loop. Then I tend to use

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from sentence_transformers import SentenceTransformer

...

    def __call__(self, model: "SentenceTransformer", ...

Feel free to add the type annotations, but you don't have to :) I can always do them later.

Tom Aarsen

kddubey · 2024-04-05T16:18:24Z

Feel free to add the type annotations, but you don't have to :) I can always do them later.

Sounds good. I tested that there isn't a circular import. But I'll keep the changes in this PR limited to truncation.

kddubey · 2024-04-05T16:27:40Z

I see that precision/quantization hasn't yet been added to other evaluators. Maybe in a future PR, support for precision and truncate_dim can be added. Lmk if you'd instead prefer that this PR adds support for truncate_dim in all evaluators.

tomaarsen · 2024-04-08T07:56:42Z

I see that precision/quantization hasn't yet been added to other evaluators. Maybe in a future PR, support for precision and truncate_dim can be added. Lmk if you'd instead prefer that this PR adds support for truncate_dim in all evaluators.

That is true, also because quantized embeddings (e.g. binary or int8) can't always be compared in the same way as regular float32 embeddings. E.g. for comparing binary embeddings, you normally do Hamming distance rather than cosine similarity. That complicates things somewhat for the evaluators.
Matryoshka-style truncation is a bit simpler, luckily.

I think this PR should be kept as-is, and then we can extend the truncate_dim to other evaluators in future PRs.

Tom Aarsen

tomaarsen · 2024-04-08T09:04:48Z

Big thanks for this excellent community PR!

kddubey · 2024-04-08T18:32:09Z

E.g. for comparing binary embeddings, you normally do Hamming distance rather than cosine similarity.

Ohh interesting, good to know

Big thanks for this excellent community PR!

You're welcome! And thank you for maintaining sentence-transformers :-)

kddubey · 2024-04-08T21:09:11Z

I think this PR should be kept as-is, and then we can extend the truncate_dim to other evaluators in future PRs.

I'll work on adding truncate_dim to the evaluators. And add SentenceTransformer type annotations in a separate PR

kddubey added 2 commits April 3, 2024 09:26

Add truncation support

fc4d5b5

no pipe

d8e09dd

kddubey commented Apr 3, 2024

View reviewed changes

sentence_transformers/SentenceTransformer.py Show resolved Hide resolved

tomaarsen reviewed Apr 4, 2024

View reviewed changes

kddubey added 2 commits April 4, 2024 07:00

truncate_sentence_embeddings -> truncate_dim

698b5df

output_dim -> truncate_dim

5d4e172

tomaarsen reviewed Apr 4, 2024

View reviewed changes

kddubey commented Apr 4, 2024

View reviewed changes

sentence_transformers/SentenceTransformer.py Outdated Show resolved Hide resolved

Also truncate when output_value is None

8125182

Truncate in EmbeddingSimilarityEvaluator

9606521

Add truncate_embeddings to docs

4466dc3

tomaarsen merged commit 4357018 into UKPLab:master Apr 8, 2024
9 checks passed

kddubey deleted the truncate-output-dims branch April 8, 2024 18:32

kddubey mentioned this pull request Apr 9, 2024

Add truncation support in evaluators #2582

Merged

ZhengHongming888 mentioned this pull request Apr 23, 2024

Enable Sentence Transformer Inference with Intel Gaudi2 GPU Supported ( 'hpu' ) - Follow up for #2557 #2608

Closed

ZhengHongming888 mentioned this pull request May 6, 2024

Enable Sentence Transformer Inference with Intel Gaudi2 GPU Supported ( 'hpu' ) - Follow up for #2557 #2630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Add truncation support #2573

[`feat`] Add truncation support #2573

kddubey commented Apr 3, 2024

tomaarsen left a comment

tomaarsen left a comment

tomaarsen commented Apr 4, 2024 •

edited

Loading

kddubey commented Apr 4, 2024 •

edited

Loading

tomaarsen commented Apr 5, 2024

kddubey commented Apr 5, 2024

kddubey commented Apr 5, 2024 •

edited

Loading

tomaarsen commented Apr 8, 2024

tomaarsen commented Apr 8, 2024

kddubey commented Apr 8, 2024

kddubey commented Apr 8, 2024

[feat] Add truncation support #2573

[feat] Add truncation support #2573

Conversation

kddubey commented Apr 3, 2024

How has this code been tested?

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen commented Apr 4, 2024 • edited Loading

kddubey commented Apr 4, 2024 • edited Loading

tomaarsen commented Apr 5, 2024

kddubey commented Apr 5, 2024

kddubey commented Apr 5, 2024 • edited Loading

tomaarsen commented Apr 8, 2024

tomaarsen commented Apr 8, 2024

kddubey commented Apr 8, 2024

kddubey commented Apr 8, 2024

[`feat`] Add truncation support #2573

[`feat`] Add truncation support #2573

tomaarsen commented Apr 4, 2024 •

edited

Loading

kddubey commented Apr 4, 2024 •

edited

Loading

kddubey commented Apr 5, 2024 •

edited

Loading