[`feat`] Integrate NanoBeIR datasets; use `model.similarity` by default in evaluators #2966

ArthurCamara · 2024-09-27T14:54:05Z

As discussed in #2848 (comment), This PR adds a new Evaluator based on the NanoBEIR collection of datasets.

It creates one InformationRetrievalEvaluator for each dataset, and aggregates the results accordingly.

Example:

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import NanoBEIREvaluator

# Load a model
model = SentenceTransformer('all-mpnet-base-v2')

datasets = ["QuoraRetrieval", "MSMARCO"]
query_prompts = {
"QuoraRetrieval": "Instruct: Given a question, retrieve questions that are semantically equivalent to the given question\nQuery: ",
"MSMARCO": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
}

evaluator = NanoBEIREvaluator(
dataset_names=datasets,
name="NanoBEIR",
query_prompts=query_prompts,
)

results = evaluator(model)
'''
NanoBEeIR Evaluation of the model on ['QuoraRetrieval', 'MSMARCO'] dataset:
Evaluating NanoBeIRNanoQuoraRetrieval
Evaluating NanoBeIRNanoMSMARCO

Average Queries: 50.0
Average Corpus: 5044.5

Aggregated for Score Function: cosine
Accuracy@1: 39.00%
Accuracy@3: 57.00%
Accuracy@5: 66.00%
Accuracy@10: 77.00%
Precision@1: 39.00%
Recall@1: 34.03%
Precision@3: 20.67%
Recall@3: 54.07%
Precision@5: 15.00%
Recall@5: 64.27%
Precision@10: 8.90%
Recall@10: 75.97%
MRR@10: 0.5004
NDCG@10: 0.5513
Aggregated for Score Function: dot
Accuracy@1: 39.00%
Accuracy@3: 57.00%
Accuracy@5: 66.00%
Accuracy@10: 77.00%
Precision@1: 39.00%
Recall@1: 34.03%
Precision@3: 20.67%
Recall@3: 54.07%
Precision@5: 15.00%
Recall@5: 64.27%
Precision@10: 8.90%
Recall@10: 75.97%
MRR@10: 0.5004
NDCG@10: 0.5513
'''
logger.info(evaluator.primary_metric)
# => "cosine_ndcg@10"
logger.info(results["mean"][evaluator.primary_metric])
# => 0.5512516989358924

(Note that this depends on #2951)

…-padded.

tomaarsen · 2024-10-17T11:50:20Z

Although the Be portion obviously stands for Benchmark, I think the abbreviated "BEIR" is usually fully capitalized, so I'd like to propagate that in this PR as well.

tomaarsen · 2024-10-17T14:15:21Z

I'm experimenting with having all outputs in the final dict, rather than a nested dict. This way, people can use any value from the evaluator to guide their e.g. early stopping. It should also match the SequentialEvaluator performance, even though the results from the NanoBEIR are now a bit hectic (i.e., one massive dict).

I hope it's okay if I push into this PR!

- Fix 'tokens' typo -> 'dimension' in model card - Group multiple evaluators with the same output keys together. - Fix edge case where datasets without languages are excluded in model card - Truncate really really long texts in model card - Make default similarity_fn_name "cosine" rather than None

tomaarsen · 2024-10-28T10:54:01Z

I've used this PR to address various other issues that I've had with evaluators:

Pull Request overview

Use model similarity function by default in the evaluators
Fix 'tokens' typo -> 'dimension' in model card
Group multiple evaluators with the same output keys together.
Fix edge case where datasets without languages are excluded in model card
Truncate really really long texts in model card
Make default similarity_fn_name "cosine" rather than None

Tom Aarsen

…cted performance

And update 'str' type to Literals

ArthurCamara · 2024-10-29T08:43:23Z

You are the best, @tomaarsen

ArthurCamara and others added 15 commits September 23, 2024 07:55

Added the possibility of masking the prompts if the tokenizer is left…

7dc7990

…-padded.

Simplify code

8d7b88b

Remove unrelated changes

c92e334

Add separate query and corpus prompts for IREvaluator

6419121

Add query and corpus prompt_name

c0ae3f6

Merge branch 'UKPLab:master' into Integrate-NanoBEIR-datasets

84063e8

Added NanoBEIREvaluator

f27c918

Rename, example and better logging

e35d454

Fix for all datasets

fec088e

Merge branch 'UKPLab:master' into Integrate-NanoBEIR-datasets

4869ea5

Remove unrelated changes

4a82531

Remove unrelated changes

8944de0

Remove unrelated changes

c018084

Remove unrelated changes

657d1a5

Remove wrong function call to InformationRetrievalEvaluator

8460cfc

tomaarsen added 2 commits October 17, 2024 13:53

Merge branch 'master' into pr-2966

f8b4b4e

Fix issue introduced in merge

2cfd817

tomaarsen added 2 commits October 17, 2024 17:39

Flatten output dict, remove 'name' as we already know the dataset names

daf25c1

tomaarsen changed the title ~~[feat] Integrate NanoBeIR datasets~~ [feat] Integrate NanoBeIR datasets; use model.similarity by default in evaluators Oct 28, 2024

tomaarsen added 3 commits October 28, 2024 16:08

Update tests due to similarity_fn_name defaulting to "cosine" now

327bb66

Specify all similarity_fn_names to be backwards compat. with old expe…

01da22d

…cted performance

Fix loading the similarity fn from a config

43dd97d

And update 'str' type to Literals

tomaarsen merged commit 210ea8b into UKPLab:master Oct 29, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Integrate NanoBeIR datasets; use `model.similarity` by default in evaluators #2966

[`feat`] Integrate NanoBeIR datasets; use `model.similarity` by default in evaluators #2966

ArthurCamara commented Sep 27, 2024

tomaarsen commented Oct 17, 2024

tomaarsen commented Oct 17, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading

ArthurCamara commented Oct 29, 2024

[feat] Integrate NanoBeIR datasets; use model.similarity by default in evaluators #2966

[feat] Integrate NanoBeIR datasets; use model.similarity by default in evaluators #2966

Conversation

ArthurCamara commented Sep 27, 2024

tomaarsen commented Oct 17, 2024

tomaarsen commented Oct 17, 2024 • edited Loading

tomaarsen commented Oct 28, 2024 • edited Loading

Pull Request overview

ArthurCamara commented Oct 29, 2024

[`feat`] Integrate NanoBeIR datasets; use `model.similarity` by default in evaluators #2966

[`feat`] Integrate NanoBeIR datasets; use `model.similarity` by default in evaluators #2966

tomaarsen commented Oct 17, 2024 •

edited

Loading

tomaarsen commented Oct 28, 2024 •

edited

Loading