Question about the abc parametrization across the two papers #34
Replies: 1 comment
-
Hi Thomas, Thanks for your interest. These two descriptions are equivalent. The former paper (TP4) notes the following symmetry The reason we differ in our presentation is that the former paper (TP4) is aimed at theoretists who tend to think about gradient descent with no "ad hoc" modification to the gradient (like per-layer learning rate), so we use a single learning rate for the entire network, at the expense of needing to scale multipliers (the More details are given in the latter paper (TP5): |
Beta Was this translation helpful? Give feedback.
-
Hi. First, thanks for your work.
Reading your two papers (https://arxiv.org/pdf/2011.14522.pdf and https://arxiv.org/pdf/2203.03466.pdf), I've seen some differences that I don't understand. On the first paper, you define the abc-Parametrization, and you write (page 2):
On page 9, you specified the Mup parametrization c is equal to 0, so the learning rate is not proportional to n.
But, on your second paper, page 5, when specifying the Mup parametrization, the learning rate are not the same for the different layers, and some are proportional to n:
I don't understand the reason of those differences ... Thanks for your time
Beta Was this translation helpful? Give feedback.
All reactions