MTF train script #295

thomasw21 · 2022-06-30T14:23:33Z

No description provided.

Muennighoff

great job! initial thoughts

finetune_t0_non_causal_decoder.py

megatron/data/data_samplers.py

megatron/data/decoder_packed_mtf_dataset.py

megatron/data/non_causal_mlm_dataset.py

tests/test_model.py

tests/test_training.py

Muennighoff · 2022-07-03T10:45:25Z

finetune_t0_non_causal_decoder.py

+    # Option 1 of data loading using --data-path
+    # For T0, data has to be provided in the form --data-path input-data target-data input-data2 target-data2 ...
+    if args.data_path:
+        # TODO: Not yet compatible with dataset weights (Will break at prefixes, weights = analyze_data_prefix(args.data_path))


I think the question is:

Do we want separate indexed datasets for each T0 dataset?

Easy to remove unwanted indexed datasets (e.g. if we the bias filtering reveals something)

Do we want to combine them all into one giant indexed dataset?

Probably faster to get started, as no need to worry about weights; small datasets with too few items to make a batch; BlendableDataset compatibility

data_path accepts multiple arguments (haven't tested but should work, it's just the weighting mechanism that we should have IMO)

works as well. I don't really know how much we're going to be able to experiment with different filtered version of T0.

tests/test_training.py

tests/test_model.py

Co-authored-by: Niklas Muennighoff <[email protected]>

- Remove unecessary code from MTFDataset - Create size API for MTF dataset - Use new size API to build packed index much faster

thomasw21 · 2022-07-04T08:46:08Z

megatron/data/decoder_packed_mtf_dataset.py

+                # TODO @thomasw21 handle the case where a single sample cannot fit inside a row. We can
+                #   - silently skip that value [currently implemented]
+                #   - truncate to `seq_length`, and keep the right part


I'm wondering is we actually should warn someone about it? Like add a warning or something. I don't know.

I think it should almost never happen for T0 & seq len 2048, right?
So having a warning would be good imo

Yeah ... let me add that.

thomasw21 · 2022-07-05T14:03:13Z

Too late for those that wanted to review.

Co-authored-by: Lintang Sutawika <[email protected]> Co-authored-by: Lintang Sutawika <[email protected]> Co-authored-by: Muennighoff <[email protected]>

Lintang Sutawika and others added 30 commits June 29, 2022 18:59

made into input and output tokens

a3af6bf

made into input and output tokens

6ad61b6

added eos

9131fdd

added eos

cb76cd3

test text_token

531ee68

test text_token

a7d1158

test text_token

0008cfb

test text_token

f1461a8

test text_token

ada0f10

assigned array

298c9b7

assigned array

d2bdff6

assigned array

4ec8db3

hardcoded sequence length

10a2b6d

check again

a373a70

show sentinal tokens

bdef71b

show sentinal tokens

262fd6c

show sentinal tokens

68a6a93

show sentinal tokens

1c00d4b

add more special tokens

8b85f11

changed how mlm data is loaded

85d204a

changed how mlm data is loaded

4c84274

changed how mlm data is loaded

084245e

changed how mlm data is loaded

32af10e

changed how mlm data is loaded

b6e0e63

added new script

2af2e4b

added new script

cc5968e

added new script

cf0b2a0

try t5 dataset

fc150a0

try t5 dataset

039f90f

try t5 dataset

7364781

thomasw21 requested a review from TevenLeScao July 2, 2022 14:06

thomasw21 added 9 commits July 2, 2022 16:34

Woops

de88ab6

Woops

d7a6388

Woops

1c2284f

Woops

b759a92

Woops

ef20e57

Silently skip samples that are too long

5816adf

Build the index from scratch everytime

37ad57e

Prevent empty dataset

1572ddc

Change the condition for empty slice

bebb481

Muennighoff reviewed Jul 2, 2022

View reviewed changes

Muennighoff reviewed Jul 3, 2022

View reviewed changes

tests/test_training.py Show resolved Hide resolved

thomasw21 added 7 commits July 3, 2022 13:56

PR reviews

5c80699

Revert back changes linked to shutil.copytree

985cd02

Get test working

41e931a

Woops

b321a34

Woops

0450bad

Fix empty samples

de4934f

Cuda kernel is not strictly equivalent

e3e21f5

Muennighoff reviewed Jul 4, 2022

View reviewed changes

tests/test_model.py Outdated Show resolved Hide resolved

thomasw21 and others added 2 commits July 4, 2022 09:35

Update tests/test_model.py

16c556c

Co-authored-by: Niklas Muennighoff <[email protected]>

MTF optimize dataloading (#298)

f2df771

- Remove unecessary code from MTFDataset - Create size API for MTF dataset - Use new size API to build packed index much faster

thomasw21 commented Jul 4, 2022

View reviewed changes

Merge remote-tracking branch 'origin/main' into thomas/mtf_train_script

df6a3ca

thomasw21 merged commit 3d5d151 into main Jul 5, 2022

thomasw21 deleted the thomas/mtf_train_script branch July 5, 2022 14:03

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 18, 2023

fix dropout of flash attention (bigscience-workshop#295)

b93495a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTF train script #295

MTF train script #295

thomasw21 commented Jun 30, 2022

Muennighoff left a comment

Muennighoff Jul 3, 2022

thomasw21 Jul 3, 2022

thomasw21 Jul 4, 2022

Muennighoff Jul 4, 2022

thomasw21 Jul 4, 2022

thomasw21 commented Jul 5, 2022

MTF train script #295

MTF train script #295

Conversation

thomasw21 commented Jun 30, 2022

Muennighoff left a comment

Choose a reason for hiding this comment

Muennighoff Jul 3, 2022

Choose a reason for hiding this comment

thomasw21 Jul 3, 2022

Choose a reason for hiding this comment

thomasw21 Jul 4, 2022

Choose a reason for hiding this comment

Muennighoff Jul 4, 2022

Choose a reason for hiding this comment

thomasw21 Jul 4, 2022

Choose a reason for hiding this comment

thomasw21 commented Jul 5, 2022