-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GISTEmbedLoss #2535
Add GISTEmbedLoss #2535
Conversation
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Signed-off-by: Aivin V. Solatorio <[email protected]>
Also call the guide model with no_grad for better memory efficiency.
Hello! Apologies for simply pushing some commits, I should have asked beforehand.
I think this is just about ready to be merged. I'm happy for you to give this another once over, if you'd like. I'm quite excited to see this work implemented 🎉 Some experiments that I ran (note: I used the new WIP refactor, that's why I have W&B logs):
Note:
|
Hello @tomaarsen , thank you so much for doing all this work! I didn't want to change much, but I totally missed the part of handling the different tokenization. Good catch there! In my original implementation, I passed the raw text as part of the features so I could use the Indeed, I think your approach of decoding and tokenizing strikes a reasonable balance between the size of the change needed. I'd be very happy if this gets merged! 🤗 |
That's indeed a more convenient approach, though I'm fairly satisfied with the one that we've got here, too. I'll have another look at this tomorrow & prepare to merge it!
|
Thanks a bunch for this! I'm excited to see whether this will be picked up by the community, I hope so.
|
@avsolatorio @tomaarsen guys, could you please help me understand if it makes sense to use the model we are training on as a guided model for loss function? for example, I am finetuning distiluse-base-multilingual-cased-v2 like that
and then I am using gistembed loss like this
Ive tested it and it seems work much better than if I use the same distiluse-base-multilingual-cased-v2 model, but not the one we are training What are your thoughts on this? Could it be right approach? Thanks |
That's a very interesting approach! I haven't tried that myself. Out of curiosity, based on what metric does it work much better than using an "unchanging" version of the base model? I'm also curious if @avsolatorio experimented with this.
|
@tomaarsen , I have not actually tried this, but this approach reminds me of something you mentioned to me last time 🤔: https://sbert.net/docs/package_reference/sentence_transformer/losses.html#onlinecontrastiveloss . Since the guide model updates with the model being trained, I think that there could be some overfitting happening as you mentioned. So, the loss looks better but the actual performance may not. I am keen on understanding how the performance was measured, @litvinovsky . That could give us a hint on what is happening here. 🙂 |
@avsolatorio @tomaarsen thank you for your response. That's very helpful. How I come up with the idea that it increases performance, because of two things
and because of batch mechanism and negative samples selection using guide model, it skips this negative pair
which I would like to not be skipped. It happens because this negative (buy, buy blablabla1) is closer than positive one (buy,sell blablabla1) but I do expect this negative pair to be not skipped, which not happens if I use separate model as guided model. If I use training model as guide model it helps to resolve this issue, so negative example above get applied properly. Why it helps? probably because most likely weights were already changed and buy, sell blablabla1 is closer than buy, buy blablabla1 not sure how this approach affects standard benchmarks. |
This is fascinating! One idea I had was to use a separate model for the first epoch or N-timesteps, create a checkpoint of this model, then use the checkpoint for the trained model as the guide model for the next epoch, then repeat. Unfortunately, I didn't have the time to test this. But, your explanation seems to suggest that this could work! And your point that the model learning from the data itself could result to outperforming a static guide model sounds quite sensible to me! 😀 |
This is the implementation for the GISTEmbed loss detailed in https://arxiv.org/abs/2402.16829. This implementation supports both anchor-positive pair inputs or triplets.