Add a scoring function in model configuration #2490

ir2718 · 2024-02-19T23:33:07Z

PR overview

The PR adds a score_function parameter in the model configuration upon saving the model (#2441) and updates the rest of the code accordingly.

Details

Upon loading the model, the user can provide a string value for choosing the score function to be used. In case this score function is left blank, it will be chosen depending on the best-performing score function in training.
An important design choice here is that the best-performing score function is chosen solely depending on the chosen evaluator class. This only made sense for some evaluators eg. LabelAccuracyEvaluator assumes the model outputs a single vector so you can't compare it to anything. I've implemented it only for the following evaluators:
- BinaryClassificationEvaluator
- EmbeddingSimilarityEvaluator
- InformationRetrievalEvaluator
Another important thing to mention is that some evaluators have predefined score function values:
- ParaphraseMiningEvaluator
- RerankingEvaluator
- TranslationEvaluator

Before adding this, I wanted to ask whether I should even add evaluation for different score functions or should I keep it as is?

Upon loading a SentenceTransformer model, the score function is determined by the score_function_name parameter in the model configuration file.

I would also like to add that this definitely needs more testing before merging.

tomaarsen · 2024-02-21T12:15:28Z

Hello!

Thanks a bunch for this PR! It's a bit big, so it's taken a bit of time to have a good look at it. I like the direction, and I think it's clever to try and infer the best scoring function from the evaluator if no scoring function was explicitly provided. I'm not sure it's the smartest approach though, as it might be unexpected to see multiple similarly trained models e.g. have different scoring functions.

I'm also considering using the scoring function defined in the model to inform the choice of scoring function in the evaluators, but I'm not sure if that's ideal.

Beyond that, I think it would be interesting for the SimilarityFunction class to store the two kinds of scoring functions: "normal" and "pairwise". On the SentenceTransformer, we could maybe expose those functions as well? Perhaps a similarity and a pairwise_similarity method or something.

As for the changes in the evaluators: some of them are slightly problematic I think. In short: I want all code that works currently to keep working. This means that we can't easily e.g. update the name of a keyword argument like replacing similarity_fct with score_function in the RerankingEvaluator.

This is definitely a tricky PR!

Tom Aarsen

ir2718 · 2024-02-21T15:21:24Z

it might be unexpected to see multiple similarly trained models e.g. have different scoring functions.

To be honest, I didn't think about this. I have a feeling that cosine similarity is the standard score function for most tasks. So I guess I would suggest to set that as the default, and then in case users want to set the scoring function to the best-performing score function they can explicitly set it to None?

I'm also considering using the scoring function defined in the model to inform the choice of scoring function in the evaluators, but I'm not sure if that's ideal.

I think this is better than the current solution. If you think this is a good design choice, I can implement it.

Beyond that, I think it would be interesting for the SimilarityFunction class to store the two kinds of scoring functions: "normal" and "pairwise".

I don't know why anyone would use different scoring functions for normal and pairwise similarity calculation. Could you give me an example of why this would be an improvement?

On the SentenceTransformer, we could maybe expose those functions as well? Perhaps a similarity and a pairwise_similarity method

Hard agree with this one, I'll add it.

This means that we can't easily e.g. update the name of a keyword argument like replacing similarity_fct with score_function in the RerankingEvaluator.

I'm aware I didn't actually implement it for the Reranking evaluator, as I wasn't sure if it makes sense and wanted to hear your input. Can you explain a bit more what you mean by this just to make sure I'm on the right track?

tomaarsen · 2024-02-21T16:17:58Z

To be honest, I didn't think about this. I have a feeling that cosine similarity is the standard score function for most tasks. So I guess I would suggest to set that as the default, and then in case users want to set the scoring function to the best-performing score function they can explicitly set it to None?

That's a really cool idea indeed. Seems good. And Cosine is a good default I think.

I think this is better than the current solution. If you think this is a good design choice, I can implement it.

I think it would eventually be best, but perhaps we can keep the changes in the evaluators small for now.

I don't know why anyone would use different scoring functions for normal and pairwise similarity calculation. Could you give me an example of why this would be an improvement?

It would never be different scoring functions, but I mean that sometimes you'd want to use pairwise_cos_sim and sometimes cos_sim. For context, when comparing 2 tensors with 20 embeddings each, pairwise_cos_sim returns a Tensor with shape (20,), i.e. a similarity for each pair, while cos_sim returns a Tensor with shape (20, 20) for each possible pair.

I'm aware I didn't actually implement it for the Reranking evaluator, as I wasn't sure if it makes sense and wanted to hear your input. Can you explain a bit more what you mean by this just to make sure I'm on the right track?

This is very tricky, but I'd like to keep all changes backwards compatible, i.e. code that used to work still works. If someone wrote some training script that said RerankingEvaluator(..., similarity_fct=my_custom_similarity_fct) then that training script will fail unexpectedly if we release this Pull Request into a new version of Sentence Transformers. Does that make sense?

These changes are very tricky to navigate, and I think it might be best to minimize any changes in the evaluators and primarily change SentenceTransformers.py, util.py & the new SimilarityFunctions.py.

In particular, I think these changes are good (at a glance):

BinaryClassificationEvaluator.py
EmbeddingSimilarityEvaluator.py
ParaphraseMiningEvaluator.py
SequentialEvaluator.py

And I think these have some issues:

InformationRetrievalEvaluator.py: If people provide custom dictionaries then that might fail.
RerankingEvaluator.py: The change in the signature can cause old code to fail.
TripletEvaluator.py: The change in the signature can cause old code to fail.

So perhaps we can update the SimilarityFunction to have a to_similarity_fn and to_pairwise_similarity_fn static methods? And SentenceTransformer can get similarity and pairwise_similarity methods that use these functions. Additionally, the score function name can be stored in the model config.

Tom Aarsen

ir2718 · 2024-02-21T23:22:38Z

Thanks for the clarification, everything makes sense now.

sentence_transformers/SimilarityFunctions.py

fkdosilovic · 2024-02-24T17:06:14Z

sentence_transformers/SentenceTransformer.py

General comment for all files

I would switch to using f-strings instead of format method. I saw as well that format method is used across the library, but this seems to be due to lack of refactoring when f-strings became available. Maybe @tomaarsen can provide more context.

I'm fine with either option for the purposes of this PR - it seems smart to move towards f-string formatting in a separate PR. I think I will wait with removing the format uses after #2426 and #2449 as otherwise those PRs would get a lot of merge conflicts.

… UKPLab-master

tomaarsen · 2024-02-27T10:51:33Z

Hello!

Is it okay if I take over the development here @ir2718? I have a bit of time this week & I'd like to push this up my priority list 😄

Tom Aarsen

ir2718 · 2024-02-27T16:14:26Z

Hey @tomaarsen,

Sorry for the late response, I've been busy lately. I'm pretty sure I've covered most of the cases and I don't mind you taking over. There are two caveats I'd like to mention:

I've added support support for using None as a similarity function in RerankingEvaluator and ParaphraseMiningEvaluator as only 1 or 2 similarity functions were supported
- I think these definitely need most testing
I noticed that the previous definitions of pairwise and normal functions were inverted, but I kept them for backward compatibility

Feel free to ping me in case you need my help.

tomaarsen · 2024-02-28T13:54:39Z

I noticed that the previous definitions of pairwise and normal functions were inverted, but I kept them for backward compatibility

Oh, odd. I'll look into that as well. That sounds like a headache :S

ir2718 added 11 commits January 26, 2024 00:22

add score function to SentenceTransformer

94ad845

add manhattan and euclidean sim

ba1a92a

remove duplicate calls in util

9830d4a

redo evaluators

efb1ec1

fix st, add binary classification and ir

eb689fc

update triplet evaluator

f48e5d2

merge score_f_save into master

8056ada

Merge remote-tracking branch 'origin/master' into score_f_save

befbc44

update sequential evaluator, update paraphrase minig

3a3e525

update paraphrase mining

692af4a

update enum

f1daa1b

add pairwise functions

7291d1c

fkdosilovic reviewed Feb 24, 2024

View reviewed changes

sentence_transformers/SimilarityFunctions.py Outdated Show resolved Hide resolved

fkdosilovic reviewed Feb 24, 2024

View reviewed changes

sentence_transformers/SimilarityFunctions.py Outdated Show resolved Hide resolved

fkdosilovic reviewed Feb 24, 2024

View reviewed changes

update params for backward compatibility, update tests

42b5dcb

ir2718 marked this pull request as draft February 26, 2024 04:32

ir2718 added 6 commits February 26, 2024 23:45

update utils, add some tests

2dcbded

Merge branch 'master' of github.com:UKPLab/sentence-transformers into…

67227e7

… UKPLab-master

Merge branch 'UKPLab-master'

8b4b6d6

Merge branch 'score_f_save'

0deb7cc

run formatting

7937646

remove unused imports

42392cc

fix some bugs in evaluators

730eeac

tomaarsen mentioned this pull request Feb 29, 2024

The util.cos_sim always uses Tensor object and also returns the same when we are working with numpy #2513

Open

This was referenced Apr 25, 2024

Add similarity and similarity_pairwise methods to Sentence Transformers tomaarsen/sentence-transformers#2

Closed

[v3] Add similarity and similarity_pairwise methods to Sentence Transformers #2615

Merged

ir2718 closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a scoring function in model configuration #2490

Add a scoring function in model configuration #2490

ir2718 commented Feb 19, 2024

tomaarsen commented Feb 21, 2024

ir2718 commented Feb 21, 2024

tomaarsen commented Feb 21, 2024 •

edited

Loading

ir2718 commented Feb 21, 2024

fkdosilovic Feb 24, 2024

tomaarsen Feb 26, 2024

tomaarsen commented Feb 27, 2024

ir2718 commented Feb 27, 2024 •

edited

Loading

tomaarsen commented Feb 28, 2024

Add a scoring function in model configuration #2490

Add a scoring function in model configuration #2490

Conversation

ir2718 commented Feb 19, 2024

PR overview

Details

tomaarsen commented Feb 21, 2024

ir2718 commented Feb 21, 2024

tomaarsen commented Feb 21, 2024 • edited Loading

ir2718 commented Feb 21, 2024

fkdosilovic Feb 24, 2024

Choose a reason for hiding this comment

tomaarsen Feb 26, 2024

Choose a reason for hiding this comment

tomaarsen commented Feb 27, 2024

ir2718 commented Feb 27, 2024 • edited Loading

tomaarsen commented Feb 28, 2024

tomaarsen commented Feb 21, 2024 •

edited

Loading

ir2718 commented Feb 27, 2024 •

edited

Loading