Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add truncation support #2573

Merged
merged 7 commits into from
Apr 8, 2024
Merged

[feat] Add truncation support #2573

merged 7 commits into from
Apr 8, 2024

Conversation

kddubey
Copy link
Contributor

@kddubey kddubey commented Apr 3, 2024

Hello,

This PR follows up on the discussion in #2564.

The implementation adds an optional SentenceTransformer instance attribute, output_dim, so that:

  1. the dimension of model.encode(texts) agrees with model.get_sentence_embedding_dimension()
  2. the user doesn't need to change every model.encode(texts) call to achieve truncation
  3. no changes need to be made to third-party code which internally does model.encode(texts) (e.g., the EmbeddingSimilarityEvaluator) to achieve truncation.

In case a user wants to change the truncation dimension on the fly, they can use these new utilities:

  • model.truncate_sentence_embeddings: context manager that sets and resets the truncation dimension, output_dim, of the model. This has the 3 benefits above
  • util.truncate_embeddings: just a simple truncation function that works even if the input embeddings is 1-D or 2-D. May be useful because ...-slicing is slightly niche knowledge.

How has this code been tested?

New unit test:

pytest tests/test_sentence_transformer.py -k test_encode_truncate -x

It adds ~30 sec. to the testing workflow.

Copy link
Collaborator

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This is a really strong PR. I've made some comments, though I believe only 1 warrants a discussion (regarding truncate_dim).

examples/training/matryoshka/README.md Outdated Show resolved Hide resolved
sentence_transformers/SentenceTransformer.py Show resolved Hide resolved
sentence_transformers/SentenceTransformer.py Outdated Show resolved Hide resolved
sentence_transformers/SentenceTransformer.py Show resolved Hide resolved
sentence_transformers/SentenceTransformer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking solid now. I'll experiment with these changes locally myself to verify that it all works as expected, and perhaps I'll see if it also works correctly with some of our third party applications (langchain, llamaindex, etc.). I'll also double-check that the new methods are correctly included in the built docs.

@tomaarsen
Copy link
Collaborator

tomaarsen commented Apr 4, 2024

Another note, it might make sense to add support for truncate_dim here as well. Then users can write:

evaluators = []
for truncate_dim in [64, 128, 256, 512, 768]:
    evaluators.append(evaluation.EmbeddingSimilarityEvaluator(
        stsb_dev["sentence1"],
        stsb_dev["sentence2"],
        stsb_dev["score"],
        name="sts-dev-{truncate_dim}",
        truncate_dim=truncate_dim,
    ))
evaluator = evaluation.SequentialEvaluator(evaluators)

to get evaluations during training at different matryoshka dimensions. There might be other evaluators where it could be applied, but EmbeddingSimilarityEvaluator is most commonly used I believe. What do you think?

  • Tom Aarsen

@kddubey
Copy link
Contributor Author

kddubey commented Apr 4, 2024

Interesting, first time I've seen SequentialEvaluator. Given that, a truncate_dim parameter would be necessary. I'll add it to the EmbeddingSimilarityEvaluator and others where I can see model.encode getting called (update).

Update: 9606521 implements truncation for EmbeddingSimilarityEvaluator

Quick question: in the __call__ signature of evaluators (example), the model isn't type-annotated. Is this b/c we shouldn't assume it's a SentenceTransformer, or b/c importing SentenceTransformer here causes an ImportError I'm not seeing, or it just hasn't been type-annotated yet? If it's the last reason, should I add the type annotation now?

@tomaarsen
Copy link
Collaborator

It just hasn't been type annotated, though I wouldn't be surprised if it results in a circular import loop. Then I tend to use

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from sentence_transformers import SentenceTransformer

...

    def __call__(self, model: "SentenceTransformer", ...

Feel free to add the type annotations, but you don't have to :) I can always do them later.

  • Tom Aarsen

@kddubey
Copy link
Contributor Author

kddubey commented Apr 5, 2024

Feel free to add the type annotations, but you don't have to :) I can always do them later.

Sounds good. I tested that there isn't a circular import. But I'll keep the changes in this PR limited to truncation.

@kddubey
Copy link
Contributor Author

kddubey commented Apr 5, 2024

I see that precision/quantization hasn't yet been added to other evaluators. Maybe in a future PR, support for precision and truncate_dim can be added. Lmk if you'd instead prefer that this PR adds support for truncate_dim in all evaluators.

@tomaarsen
Copy link
Collaborator

I see that precision/quantization hasn't yet been added to other evaluators. Maybe in a future PR, support for precision and truncate_dim can be added. Lmk if you'd instead prefer that this PR adds support for truncate_dim in all evaluators.

That is true, also because quantized embeddings (e.g. binary or int8) can't always be compared in the same way as regular float32 embeddings. E.g. for comparing binary embeddings, you normally do Hamming distance rather than cosine similarity. That complicates things somewhat for the evaluators.
Matryoshka-style truncation is a bit simpler, luckily.

I think this PR should be kept as-is, and then we can extend the truncate_dim to other evaluators in future PRs.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit 4357018 into UKPLab:master Apr 8, 2024
9 checks passed
@tomaarsen
Copy link
Collaborator

Big thanks for this excellent community PR!

@kddubey
Copy link
Contributor Author

kddubey commented Apr 8, 2024

E.g. for comparing binary embeddings, you normally do Hamming distance rather than cosine similarity.

Ohh interesting, good to know

Big thanks for this excellent community PR!

You're welcome! And thank you for maintaining sentence-transformers :-)

@kddubey kddubey deleted the truncate-output-dims branch April 8, 2024 18:32
@kddubey
Copy link
Contributor Author

kddubey commented Apr 8, 2024

I think this PR should be kept as-is, and then we can extend the truncate_dim to other evaluators in future PRs.

I'll work on adding truncate_dim to the evaluators. And add SentenceTransformer type annotations in a separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants