Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a scoring function in model configuration #2490

Closed
wants to merge 20 commits into from

Conversation

ir2718
Copy link
Contributor

@ir2718 ir2718 commented Feb 19, 2024

PR overview

The PR adds a score_function parameter in the model configuration upon saving the model (#2441) and updates the rest of the code accordingly.

Details

  • Upon loading the model, the user can provide a string value for choosing the score function to be used. In case this score function is left blank, it will be chosen depending on the best-performing score function in training.

  • An important design choice here is that the best-performing score function is chosen solely depending on the chosen evaluator class. This only made sense for some evaluators eg. LabelAccuracyEvaluator assumes the model outputs a single vector so you can't compare it to anything. I've implemented it only for the following evaluators:

    • BinaryClassificationEvaluator
    • EmbeddingSimilarityEvaluator
    • InformationRetrievalEvaluator
  • Another important thing to mention is that some evaluators have predefined score function values:

    • ParaphraseMiningEvaluator
    • RerankingEvaluator
    • TranslationEvaluator

Before adding this, I wanted to ask whether I should even add evaluation for different score functions or should I keep it as is?

  • Upon loading a SentenceTransformer model, the score function is determined by the score_function_name parameter in the model configuration file.

I would also like to add that this definitely needs more testing before merging.

@tomaarsen
Copy link
Collaborator

Hello!

Thanks a bunch for this PR! It's a bit big, so it's taken a bit of time to have a good look at it. I like the direction, and I think it's clever to try and infer the best scoring function from the evaluator if no scoring function was explicitly provided. I'm not sure it's the smartest approach though, as it might be unexpected to see multiple similarly trained models e.g. have different scoring functions.

I'm also considering using the scoring function defined in the model to inform the choice of scoring function in the evaluators, but I'm not sure if that's ideal.

Beyond that, I think it would be interesting for the SimilarityFunction class to store the two kinds of scoring functions: "normal" and "pairwise". On the SentenceTransformer, we could maybe expose those functions as well? Perhaps a similarity and a pairwise_similarity method or something.

As for the changes in the evaluators: some of them are slightly problematic I think. In short: I want all code that works currently to keep working. This means that we can't easily e.g. update the name of a keyword argument like replacing similarity_fct with score_function in the RerankingEvaluator.

This is definitely a tricky PR!

  • Tom Aarsen

@ir2718
Copy link
Contributor Author

ir2718 commented Feb 21, 2024

it might be unexpected to see multiple similarly trained models e.g. have different scoring functions.

To be honest, I didn't think about this. I have a feeling that cosine similarity is the standard score function for most tasks. So I guess I would suggest to set that as the default, and then in case users want to set the scoring function to the best-performing score function they can explicitly set it to None?

I'm also considering using the scoring function defined in the model to inform the choice of scoring function in the evaluators, but I'm not sure if that's ideal.

I think this is better than the current solution. If you think this is a good design choice, I can implement it.

Beyond that, I think it would be interesting for the SimilarityFunction class to store the two kinds of scoring functions: "normal" and "pairwise".

I don't know why anyone would use different scoring functions for normal and pairwise similarity calculation. Could you give me an example of why this would be an improvement?

On the SentenceTransformer, we could maybe expose those functions as well? Perhaps a similarity and a pairwise_similarity method

Hard agree with this one, I'll add it.

This means that we can't easily e.g. update the name of a keyword argument like replacing similarity_fct with score_function in the RerankingEvaluator.

I'm aware I didn't actually implement it for the Reranking evaluator, as I wasn't sure if it makes sense and wanted to hear your input. Can you explain a bit more what you mean by this just to make sure I'm on the right track?

@tomaarsen
Copy link
Collaborator

tomaarsen commented Feb 21, 2024

To be honest, I didn't think about this. I have a feeling that cosine similarity is the standard score function for most tasks. So I guess I would suggest to set that as the default, and then in case users want to set the scoring function to the best-performing score function they can explicitly set it to None?

That's a really cool idea indeed. Seems good. And Cosine is a good default I think.

I think this is better than the current solution. If you think this is a good design choice, I can implement it.

I think it would eventually be best, but perhaps we can keep the changes in the evaluators small for now.

I don't know why anyone would use different scoring functions for normal and pairwise similarity calculation. Could you give me an example of why this would be an improvement?

It would never be different scoring functions, but I mean that sometimes you'd want to use pairwise_cos_sim and sometimes cos_sim. For context, when comparing 2 tensors with 20 embeddings each, pairwise_cos_sim returns a Tensor with shape (20,), i.e. a similarity for each pair, while cos_sim returns a Tensor with shape (20, 20) for each possible pair.

I'm aware I didn't actually implement it for the Reranking evaluator, as I wasn't sure if it makes sense and wanted to hear your input. Can you explain a bit more what you mean by this just to make sure I'm on the right track?

This is very tricky, but I'd like to keep all changes backwards compatible, i.e. code that used to work still works. If someone wrote some training script that said RerankingEvaluator(..., similarity_fct=my_custom_similarity_fct) then that training script will fail unexpectedly if we release this Pull Request into a new version of Sentence Transformers. Does that make sense?

These changes are very tricky to navigate, and I think it might be best to minimize any changes in the evaluators and primarily change SentenceTransformers.py, util.py & the new SimilarityFunctions.py.

In particular, I think these changes are good (at a glance):

  1. BinaryClassificationEvaluator.py
  2. EmbeddingSimilarityEvaluator.py
  3. ParaphraseMiningEvaluator.py
  4. SequentialEvaluator.py

And I think these have some issues:

  1. InformationRetrievalEvaluator.py: If people provide custom dictionaries then that might fail.
  2. RerankingEvaluator.py: The change in the signature can cause old code to fail.
  3. TripletEvaluator.py: The change in the signature can cause old code to fail.

So perhaps we can update the SimilarityFunction to have a to_similarity_fn and to_pairwise_similarity_fn static methods? And SentenceTransformer can get similarity and pairwise_similarity methods that use these functions. Additionally, the score function name can be stored in the model config.

  • Tom Aarsen

@ir2718
Copy link
Contributor Author

ir2718 commented Feb 21, 2024

Thanks for the clarification, everything makes sense now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment for all files

I would switch to using f-strings instead of format method. I saw as well that format method is used across the library, but this seems to be due to lack of refactoring when f-strings became available. Maybe @tomaarsen can provide more context.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either option for the purposes of this PR - it seems smart to move towards f-string formatting in a separate PR. I think I will wait with removing the format uses after #2426 and #2449 as otherwise those PRs would get a lot of merge conflicts.

@ir2718 ir2718 marked this pull request as draft February 26, 2024 04:32
@tomaarsen
Copy link
Collaborator

Hello!

Is it okay if I take over the development here @ir2718? I have a bit of time this week & I'd like to push this up my priority list 😄

  • Tom Aarsen

@ir2718
Copy link
Contributor Author

ir2718 commented Feb 27, 2024

Hey @tomaarsen,

Sorry for the late response, I've been busy lately. I'm pretty sure I've covered most of the cases and I don't mind you taking over. There are two caveats I'd like to mention:

  • I've added support support for using None as a similarity function in RerankingEvaluator and ParaphraseMiningEvaluator as only 1 or 2 similarity functions were supported
    • I think these definitely need most testing
  • I noticed that the previous definitions of pairwise and normal functions were inverted, but I kept them for backward compatibility

Feel free to ping me in case you need my help.

@tomaarsen
Copy link
Collaborator

I noticed that the previous definitions of pairwise and normal functions were inverted, but I kept them for backward compatibility

Oh, odd. I'll look into that as well. That sounds like a headache :S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants