Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add t0 scripts #50

Closed
wants to merge 14 commits into from
Closed

Conversation

Muennighoff
Copy link
Collaborator

@Muennighoff Muennighoff commented Jul 4, 2022

Notes:

RE: Learning Rate
T0 & FLAN use Adafactor which automatically adjusts the step size:
Finally, while the learning rate in Adam denotes a target absolute step size, we follow the intuition that relative change in the parameters is more relevant, so we propose scaling the size of the updates relative to the scale of the parameters themselves. Due to this scaling Adafactor may more resistent to higher learning rates and the step size adjusts automatically, so scheduling may be less needed (I.e. if you have weight decay with Adafactor, step size will automatically decay because parameters decay). For now I'm keeping a constant conservative LR of 1e-5, but we may want to instead go higher and add warmup + scheduling. Thoughts?

@thomasw21
Copy link
Member

T0 leaves some HPs unspecified like Warmup, Weight Decay; Let's discuss them here?

No warmup as you have constant learning rate.
No weight decay (will double check that one)

Currently, it would use 5% of the training set for validation.

If you can actually use the validation data from T0 then I'd say this is better.

@Muennighoff
Copy link
Collaborator Author

If you can actually use the validation data from T0 then I'd say this is better.

For that either
a) Add a new arg like args.data_path that calls build_train_valid_test_datasets again with 100% valid split
b) Concat train & valid sets, make on indexed dataset & the ratio such that they are separated again
c) Use args.valid_weighted_split_paths & build_dataset_group, which doesn't work yet for MTF

I think a) or c) is best - Wdyt?

@thomasw21
Copy link
Member

Probably need to use this: it's already implemented as an API https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c5b88fb92d4417f77d729c95ce95e3a740b47065/megatron/arguments.py#L822-L840, I'll update the T0 branch to have that feature.

@@ -80,7 +79,6 @@ OPTIMIZER_ARGS=" \
--adam-eps 1e-8 \
--lr 1e-3 \
--lr-decay-style constant \
--lr-warmup-samples $LR_WARMUP_SAMPLES \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used Adafactor ... so technically I don't know what parameters matter (typically we used a decay argument, which I don't know how it translates to Adam optimizer)

train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
train/t0/tr11f-6B3-ml-t0.slurm Outdated Show resolved Hide resolved
Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked mostly at 6B3 config. Seems alright. Thanks!

train/tr13-t0/t0_test.slurm Outdated Show resolved Hide resolved
"

export CMD=" \
`pwd`/finetune_t0_non_causal_decoder.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`pwd`/finetune_t0_non_causal_decoder.py \
`pwd`/finetune_t0_causal_decoder.py \

Right now all the script use is_causal=True so we should rename this in Meg-DS PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an arg here; Lets merge that PR first before we merge here

@Muennighoff
Copy link
Collaborator Author

Already merged via other PR

@Muennighoff Muennighoff closed this Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants