You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, it is said that the final loss of attention supervision is the average of the cross entropy loss of the attention weights in each attention head. However, in
it does not seem to be an average because it is a summation and there is no division.
I am concerned about this detail because of the $\lambda$ hyperparameter. If one is going to implement the loss with an average (as the paper says), $\lambda$ is being divided by the number of heads, e.g., 12, which may impact the reproducibility of the hyperparameters in the paper.
Did I get it right? I would appreciate any clarification on this matter.
Thank you very much! 😊
The text was updated successfully, but these errors were encountered:
Dear authors,
In the paper, it is said that the final loss of attention supervision is the average of the cross entropy loss of the attention weights in each attention head. However, in
HateXplain/Models/bertModels.py
Line 57 in 01d7422
I am concerned about this detail because of the$\lambda$ hyperparameter. If one is going to implement the loss with an average (as the paper says), $\lambda$ is being divided by the number of heads, e.g., 12, which may impact the reproducibility of the hyperparameters in the paper.
Did I get it right? I would appreciate any clarification on this matter.
Thank you very much! 😊
The text was updated successfully, but these errors were encountered: