-
-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Varying weights to losses depending on the length of the sequence #291
Comments
@shuishida i think the current way is the correct behavior as typically we think token centric. but perhaps i could offer a |
I guess my suggestion is to change
to
since the number of elements in the loss can vary depending on the number of unmasked tokens in the outputs. But if this change affects many evaluations or if there's a benefit of having some varied loss magnitudes I also understand if you want to keep it as it is. |
@shuishida just to be sure we are on the same page, are you aware that the loss by default is not reduced? https://github.com/lucidrains/x-transformers/blob/main/x_transformers/continuous.py#L172 (
|
Yes, I am aware. So I guess an example would be, let's say if I have losses for 2 sequences, each with sequence length 3.
If I don't have any masking then after
However, if we mask this with the following:
then
Note that it will be a flattened vector rather than the 2D tensor that we had before. Also the number of elements will decrease to the number of positive masks. If we apply
However, I think it would be more natural if the first sequence loss isn't affected by the masking happening in the second sequence, as this:
We can achieve this if we do |
This is the point I wanted to make, but it's a minor problem. |
@shuishida yes i see, but i can also see the argument against it when in doubt, i'll just make it a hyperparameter, give me 5 minutes |
@shuishida hey Shu, setting this to |
Fantastic, thank you! |
@shuishida happy training |
Sorry, I think there's still some misunderstanding :P The title of the issue "Varying weights to losses depending on the length of the sequence" is describing the symptom of the issue, as opposed to a feature request. I see that the flag you've added
has added a new feature of weighting different sequences, which is kind of cool, but this wasn't what I was suggesting. I was pointing out that depending on how you mask the sequences, there is a side-effect that the batch elements inadvertently get reweighted. (example illustrated in #291 (comment)) Anyways I don't think it matters too much but just wanted to keep it on record in case someone in the future comes across this issue and get confused :P |
Anyways I don't want to take up any more of your time. Thank you so much for your amazing work! |
@shuishida ohh I see, I don't think I agree then, as that would give batches with high variance of sequence lengths less weight than the ones with low variance (less masking) regardless, thanks for bringing it up |
Because the mean reduce is applied after filtering with a mask, it seems that if the input sequences vary in length (and therefore the size of the masks are different) then for batches were short sequences dominate within the batch will be weighted higher than batches with longer sequences. Although it shouldn't be a large effect, I wonder if it is better to set the masked loss values to zero and then apply the mean reduce for consistency of loss weighting?
https://github.com/lucidrains/x-transformers/blob/144d9ba84955139347e798ab025457b2d7adc314/x_transformers/continuous.py#L225C1-L225C30
The text was updated successfully, but these errors were encountered: