Knowledge distillation training performance #58

abedkhooli · 2024-09-29T18:46:19Z

abedkhooli
Sep 29, 2024

Pylate looks easy and promising - great tool.
I am trying to train a model using the KD example and the same train data of MS MARCO but Arabic (translated) version of queries and documents.
Initial loss (losses.Distillation) starts in the range 0.05x and ends in 0.04x after one epoch - is this typical?
I am using 250k queries with their corresponding scores. Base model: aubmindlab/bert-base-arabertv02

NohTow · 2024-09-30T08:02:15Z

NohTow
Sep 30, 2024
Maintainer

Hello,
Thanks for your kind words!

Indeed, the loss is quite small ; this is because the scores are normalized (and so should be the teachers scores), following the results of JaColBERTv2.5..

This make me think that the default should be no normalization and only let as a (documented) option as it might be confusing (and sub-optimal) for people unaware.
I opened an issue to set the change the default and will take care of this in the next days. In the mean time, you can either also normalize the scores of the teacher in your dataset (ours is already normalized) or just set normalize_scores to False when creating the loss.

17 replies

NohTow Oct 2, 2024
Maintainer

Yes indeed, you should use the documents and queries that are in lightonai/ms-marco-en-bge, they are in line with the ids with the ids related to scores. See this example.

There is an error in model-card creation while training and when saving the trained mode.

Yes, this is a known issue I need to fix, I procrastinated it as it is not crashing the training and just hurts the end "model card" on the hub, but I need to fix it at some point. This is sentence-transformers trying to create the model card but the code in PyLate remove some of the values needed for the card to make inference easier to read.

Happy to know that everything is working now!

abedkhooli Oct 2, 2024
Author

Thanks. For queries and documents I had to bring the corresponding Arabic versions.
Do you think the model card error is from Pylate or Sentence Transformers?

abedkhooli Oct 25, 2024
Author

Are there recent changes to KD loss?
I used a sentence transformer (SBERT) model to score docs against queries (MS Marco). When I train KD in Pylate the loss is almost constant at -3.46(07/8/9).

NohTow Oct 25, 2024
Maintainer

Not really.
I have been training a lot of ColBERT models with KD loss on the released datasets lately with a bunch of different base model arch and size and had no issue.
If you do not minmax-normalize the scores of your query/documents, you can try setting normalize_scores to False in the loss (see #66).

abedkhooli Oct 25, 2024
Author

Tried both T/F. Scores are cosine sims. The resulting model actually retrieves correctly with 100k samples and 1 epoch. Will score more and try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge distillation training performance #58

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 17 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Knowledge distillation training performance #58

abedkhooli Sep 29, 2024

Replies: 1 comment · 17 replies

NohTow Sep 30, 2024 Maintainer

NohTow Oct 2, 2024 Maintainer

abedkhooli Oct 2, 2024 Author

abedkhooli Oct 25, 2024 Author

NohTow Oct 25, 2024 Maintainer

abedkhooli Oct 25, 2024 Author

abedkhooli
Sep 29, 2024

Replies: 1 comment 17 replies

NohTow
Sep 30, 2024
Maintainer

NohTow Oct 2, 2024
Maintainer

abedkhooli Oct 2, 2024
Author

abedkhooli Oct 25, 2024
Author

NohTow Oct 25, 2024
Maintainer

abedkhooli Oct 25, 2024
Author