-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add t0 scripts #50
Add t0 scripts #50
Conversation
No warmup as you have constant learning rate.
If you can actually use the validation data from T0 then I'd say this is better. |
For that either I think a) or c) is best - Wdyt? |
Probably need to use this: it's already implemented as an API https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c5b88fb92d4417f77d729c95ce95e3a740b47065/megatron/arguments.py#L822-L840, I'll update the T0 branch to have that feature. |
train/t0/tr11f-6B3-ml-t0.slurm
Outdated
@@ -80,7 +79,6 @@ OPTIMIZER_ARGS=" \ | |||
--adam-eps 1e-8 \ | |||
--lr 1e-3 \ | |||
--lr-decay-style constant \ | |||
--lr-warmup-samples $LR_WARMUP_SAMPLES \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used Adafactor
... so technically I don't know what parameters matter (typically we used a decay
argument, which I don't know how it translates to Adam
optimizer)
Co-authored-by: Thomas Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked mostly at 6B3 config. Seems alright. Thanks!
" | ||
|
||
export CMD=" \ | ||
`pwd`/finetune_t0_non_causal_decoder.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`pwd`/finetune_t0_non_causal_decoder.py \ | |
`pwd`/finetune_t0_causal_decoder.py \ |
Right now all the script use is_causal=True
so we should rename this in Meg-DS PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an arg here; Lets merge that PR first before we merge here
Co-authored-by: Thomas Wang <[email protected]>
Already merged via other PR |
Notes:
RE: Learning Rate
T0 & FLAN use Adafactor which automatically adjusts the step size:
Finally, while the learning rate in Adam denotes a target absolute step size, we follow the intuition that relative change in the parameters is more relevant, so we propose scaling the size of the updates relative to the scale of the parameters themselves. Due to this scaling Adafactor may more resistent to higher learning rates and the step size adjusts automatically, so scheduling may be less needed (I.e. if you have weight decay with Adafactor, step size will automatically decay because parameters decay). For now I'm keeping a constant conservative LR of
1e-5
, but we may want to instead go higher and add warmup + scheduling. Thoughts?