Release v2.6.0 - Embedding Quantization, GISTEmbedLoss · UKPLab/sentence-transformers

This release brings embedding quantization: a way to heavily speed up retrieval & other tasks, and a new powerful loss function: GISTEmbedLoss.

Install this version with

pip install sentence-transformers==2.6.0

Embedding Quantization

Embeddings may be challenging to scale up, which leads to expensive solutions and high latencies. However, there is a new approach to counter this problem; it entails reducing the size of each of the individual values in the embedding: Quantization. Experiments on quantization have shown that we can maintain a large amount of performance while significantly speeding up computation and saving on memory, storage, and costs.

To be specific, using binary quantization may result in retaining 96% of the retrieval performance, while speeding up retrieval by 25x and saving on memory & disk space with 32x. Do not underestimate this approach! Read more about Embedding Quantization in our extensive blogpost.

Binary and Scalar Quantization

Two forms of quantization exist at this time: binary and scalar (int8). These quantize embedding values from float32 into binary and int8, respectively. For Binary quantization, you can use the following snippet:

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# 2a. Encode some text using "binary" quantization
binary_embeddings = model.encode(
    ["I am driving to the lake.", "It is a beautiful day."],
    precision="binary",
)

# 2b. or, encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
binary_embeddings = quantize_embeddings(embeddings, precision="binary")

References:

GISTEmbedLoss

GISTEmbedLoss, as introduced in Solatorio (2024), is a guided variant of the more standard in-batch negatives (MultipleNegativesRankingLoss) loss. Both loss functions are provided with a list of (anchor, positive) pairs, but while MultipleNegativesRankingLoss uses anchor_i and positive_i as positive pair and all positive_j with i != j as negative pairs, GISTEmbedLoss uses a second model to guide the in-batch negative sample selection.

This can be very useful, because it is plausible that anchor_i and positive_j are actually quite semantically similar. In this case, GISTEmbedLoss would not consider them a negative pair, while MultipleNegativesRankingLoss would. When finetuning MPNet-base on the AllNLI dataset, these are the Spearman correlation based on cosine similarity using the STS Benchmark dev set (higher is better):

The blue line is MultipleNegativesRankingLoss, whereas the grey line is GISTEmbedLoss with the small all-MiniLM-L6-v2 as the guide model. Note that all-MiniLM-L6-v2 by itself does not reach 88 Spearman correlation on this dataset, so this is really the effect of two models (mpnet-base and all-MiniLM-L6-v2) reaching a performance that they could not reach separately.

Soft `save_to_hub` Deprecation

Most codebases that allow for pushing models to the Hugging Face Hub adopt a push_to_hub method instead of a save_to_hub method, and now Sentence Transformers will follow that convention. The push_to_hub method will now be the recommended approach, although save_to_hub will continue to exist for the time being: it will simply call push_to_hub internally.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

...

# Train the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=dev_evaluator,
    epochs=num_epochs,
    evaluation_steps=1000,
    warmup_steps=warmup_steps,
)

# Push the model to Hugging Face
model.push_to_hub("tomaarsen/mpnet-base-nli-stsb")

All changes

Add GISTEmbedLoss by @avsolatorio in #2535
[feat] Add 'get_config_dict' method to GISTEmbedLoss for better model cards by @tomaarsen in #2543
Enable saving modules as pytorch_model.bin by @CKeibel in #2542
[deprecation] Deprecate save_to_hub in favor of push_to_hub; add safe_serialization support to push_to_hub by @tomaarsen in #2544
Fix SentenceTransformer encode documentation return type default (numpy vectors) by @CKeibel in #2546
[docs] Update return docstring of encode_multi_process by @tomaarsen in #2548
[feat] Add binary & scalar embedding quantization support to Sentence Transformers by @tomaarsen in #2549

New Contributors

@avsolatorio made their first contribution in #2535
@CKeibel made their first contribution in #2542

Full Changelog: v2.5.1...v2.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.6.0 - Embedding Quantization, GISTEmbedLoss

Embedding Quantization

Binary and Scalar Quantization

GISTEmbedLoss

Soft `save_to_hub` Deprecation

All changes

New Contributors

Contributors

v2.6.0 - Embedding Quantization, GISTEmbedLoss

Embedding Quantization

Binary and Scalar Quantization

GISTEmbedLoss

Soft save_to_hub Deprecation

All changes

New Contributors

Contributors

Soft `save_to_hub` Deprecation