Knowledge distillation training performance #58
Unanswered
abedkhooli
asked this question in
Q&A
Replies: 1 comment 17 replies
-
Hello, Indeed, the loss is quite small ; this is because the scores are normalized (and so should be the teachers scores), following the results of JaColBERTv2.5.. This make me think that the default should be no normalization and only let as a (documented) option as it might be confusing (and sub-optimal) for people unaware. |
Beta Was this translation helpful? Give feedback.
17 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Pylate looks easy and promising - great tool.
I am trying to train a model using the KD example and the same train data of MS MARCO but Arabic (translated) version of queries and documents.
Initial loss (losses.Distillation) starts in the range 0.05x and ends in 0.04x after one epoch - is this typical?
I am using 250k queries with their corresponding scores. Base model: aubmindlab/bert-base-arabertv02
Beta Was this translation helpful? Give feedback.
All reactions