Add t0 scripts #50

Muennighoff · 2022-07-04T09:36:41Z

Notes:

RE: Learning Rate
T0 & FLAN use Adafactor which automatically adjusts the step size:
Finally, while the learning rate in Adam denotes a target absolute step size, we follow the intuition that relative change in the parameters is more relevant, so we propose scaling the size of the updates relative to the scale of the parameters themselves. Due to this scaling Adafactor may more resistent to higher learning rates and the step size adjusts automatically, so scheduling may be less needed (I.e. if you have weight decay with Adafactor, step size will automatically decay because parameters decay). For now I'm keeping a constant conservative LR of 1e-5, but we may want to instead go higher and add warmup + scheduling. Thoughts?

thomasw21 · 2022-07-04T10:07:51Z

T0 leaves some HPs unspecified like Warmup, Weight Decay; Let's discuss them here?

No warmup as you have constant learning rate.
No weight decay (will double check that one)

Currently, it would use 5% of the training set for validation.

If you can actually use the validation data from T0 then I'd say this is better.

Muennighoff · 2022-07-04T11:26:27Z

If you can actually use the validation data from T0 then I'd say this is better.

For that either
a) Add a new arg like args.data_path that calls build_train_valid_test_datasets again with 100% valid split
b) Concat train & valid sets, make on indexed dataset & the ratio such that they are separated again
c) Use args.valid_weighted_split_paths & build_dataset_group, which doesn't work yet for MTF

I think a) or c) is best - Wdyt?

thomasw21 · 2022-07-04T12:05:22Z

Probably need to use this: it's already implemented as an API https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c5b88fb92d4417f77d729c95ce95e3a740b47065/megatron/arguments.py#L822-L840, I'll update the T0 branch to have that feature.

thomasw21 · 2022-07-04T12:42:04Z

train/t0/tr11f-6B3-ml-t0.slurm

@@ -80,7 +79,6 @@ OPTIMIZER_ARGS=" \
    --adam-eps 1e-8 \
    --lr 1e-3 \
    --lr-decay-style constant \
-    --lr-warmup-samples $LR_WARMUP_SAMPLES \


We used Adafactor ... so technically I don't know what parameters matter (typically we used a decay argument, which I don't know how it translates to Adam optimizer)

train/t0/tr11f-6B3-ml-t0.slurm

Co-authored-by: Thomas Wang <[email protected]>

thomasw21

Looked mostly at 6B3 config. Seems alright. Thanks!

train/tr13-t0/t0_test.slurm

thomasw21 · 2022-07-16T09:15:47Z

train/tr13-t0/tr13f-6B3-ml-t0.slurm

+    "
+
+export CMD=" \
+    `pwd`/finetune_t0_non_causal_decoder.py \


Suggested change

`pwd`/finetune_t0_non_causal_decoder.py \

`pwd`/finetune_t0_causal_decoder.py \

Right now all the script use is_causal=True so we should rename this in Meg-DS PR.

Added an arg here; Lets merge that PR first before we merge here

Co-authored-by: Thomas Wang <[email protected]>

Muennighoff · 2022-11-03T17:35:02Z

Already merged via other PR

Muennighoff added 2 commits July 4, 2022 11:36

Add t0 scripts

94adbdd

Add T0 specific args

3b8cc31

Remove warmup

058e60b

thomasw21 reviewed Jul 4, 2022

View reviewed changes

Muennighoff and others added 10 commits July 4, 2022 15:33

Update train/t0/tr11f-6B3-ml-t0.slurm

d93e169

Co-authored-by: Thomas Wang <[email protected]>

t0 -> tr13

78791e2

t0 -> tr13

094a273

Use weighted-split-path

2fc43b0

Add 350M script & adjust HPs

41c1ef4

Adjust tr13f-350M

877158b

Add weight decay based on FLAN

5674c69

Remove finetune & add checkpoint-activations

582ee32

Remove finetune

d5b1622

Change BS for more throughput & increase LR

56729d3

thomasw21 approved these changes Jul 16, 2022

View reviewed changes

Update train/tr13-t0/t0_test.slurm

318dfc7

Co-authored-by: Thomas Wang <[email protected]>

Muennighoff closed this Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add t0 scripts #50

Add t0 scripts #50

Muennighoff commented Jul 4, 2022 •

edited

Loading

thomasw21 commented Jul 4, 2022

Muennighoff commented Jul 4, 2022

thomasw21 commented Jul 4, 2022

thomasw21 Jul 4, 2022

thomasw21 left a comment

thomasw21 Jul 16, 2022

Muennighoff Jul 16, 2022

Muennighoff commented Nov 3, 2022

	`pwd`/finetune_t0_non_causal_decoder.py \
	`pwd`/finetune_t0_causal_decoder.py \

Add t0 scripts #50

Add t0 scripts #50

Conversation

Muennighoff commented Jul 4, 2022 • edited Loading

thomasw21 commented Jul 4, 2022

Muennighoff commented Jul 4, 2022

thomasw21 commented Jul 4, 2022

thomasw21 Jul 4, 2022

Choose a reason for hiding this comment

thomasw21 left a comment

Choose a reason for hiding this comment

thomasw21 Jul 16, 2022

Choose a reason for hiding this comment

Muennighoff Jul 16, 2022

Choose a reason for hiding this comment

Muennighoff commented Nov 3, 2022

Muennighoff commented Jul 4, 2022 •

edited

Loading