-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat
] Add truncation support
#2573
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! This is a really strong PR. I've made some comments, though I believe only 1 warrants a discussion (regarding truncate_dim
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking solid now. I'll experiment with these changes locally myself to verify that it all works as expected, and perhaps I'll see if it also works correctly with some of our third party applications (langchain, llamaindex, etc.). I'll also double-check that the new methods are correctly included in the built docs.
Another note, it might make sense to add support for evaluators = []
for truncate_dim in [64, 128, 256, 512, 768]:
evaluators.append(evaluation.EmbeddingSimilarityEvaluator(
stsb_dev["sentence1"],
stsb_dev["sentence2"],
stsb_dev["score"],
name="sts-dev-{truncate_dim}",
truncate_dim=truncate_dim,
))
evaluator = evaluation.SequentialEvaluator(evaluators) to get evaluations during training at different matryoshka dimensions. There might be other evaluators where it could be applied, but
|
Interesting, first time I've seen Update: 9606521 implements truncation for Quick question: in the |
It just hasn't been type annotated, though I wouldn't be surprised if it results in a circular import loop. Then I tend to use from typing import TYPE_CHECKING
if TYPE_CHECKING:
from sentence_transformers import SentenceTransformer
...
def __call__(self, model: "SentenceTransformer", ... Feel free to add the type annotations, but you don't have to :) I can always do them later.
|
Sounds good. I tested that there isn't a circular import. But I'll keep the changes in this PR limited to truncation. |
I see that |
That is true, also because quantized embeddings (e.g. binary or int8) can't always be compared in the same way as regular float32 embeddings. E.g. for comparing binary embeddings, you normally do Hamming distance rather than cosine similarity. That complicates things somewhat for the evaluators. I think this PR should be kept as-is, and then we can extend the truncate_dim to other evaluators in future PRs.
|
Big thanks for this excellent community PR! |
Ohh interesting, good to know
You're welcome! And thank you for maintaining sentence-transformers :-) |
I'll work on adding |
Hello,
This PR follows up on the discussion in #2564.
The implementation adds an optional
SentenceTransformer
instance attribute,output_dim
, so that:model.encode(texts)
agrees withmodel.get_sentence_embedding_dimension()
model.encode(texts)
call to achieve truncationmodel.encode(texts)
(e.g., theEmbeddingSimilarityEvaluator
) to achieve truncation.In case a user wants to change the truncation dimension on the fly, they can use these new utilities:
model.truncate_sentence_embeddings
: context manager that sets and resets the truncation dimension,output_dim
, of themodel
. This has the 3 benefits aboveutil.truncate_embeddings
: just a simple truncation function that works even if the inputembeddings
is 1-D or 2-D. May be useful because...
-slicing is slightly niche knowledge.How has this code been tested?
New unit test:
It adds ~30 sec. to the testing workflow.