Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTF train script #295

Merged
merged 297 commits into from
Jul 5, 2022
Merged

MTF train script #295

merged 297 commits into from
Jul 5, 2022

Conversation

thomasw21
Copy link
Member

No description provided.

@thomasw21 thomasw21 requested a review from TevenLeScao July 2, 2022 14:06
Copy link
Collaborator

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job! initial thoughts

finetune_t0_non_causal_decoder.py Show resolved Hide resolved
megatron/data/data_samplers.py Outdated Show resolved Hide resolved
megatron/data/decoder_packed_mtf_dataset.py Outdated Show resolved Hide resolved
megatron/data/non_causal_mlm_dataset.py Outdated Show resolved Hide resolved
megatron/data/non_causal_mlm_dataset.py Outdated Show resolved Hide resolved
tests/test_model.py Outdated Show resolved Hide resolved
tests/test_model.py Show resolved Hide resolved
tests/test_training.py Outdated Show resolved Hide resolved
tests/test_training.py Outdated Show resolved Hide resolved
tests/test_training.py Show resolved Hide resolved
# Option 1 of data loading using --data-path
# For T0, data has to be provided in the form --data-path input-data target-data input-data2 target-data2 ...
if args.data_path:
# TODO: Not yet compatible with dataset weights (Will break at prefixes, weights = analyze_data_prefix(args.data_path))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question is:

  • Do we want separate indexed datasets for each T0 dataset?
    • Easy to remove unwanted indexed datasets (e.g. if we the bias filtering reveals something)

  • Do we want to combine them all into one giant indexed dataset?
    • Probably faster to get started, as no need to worry about weights; small datasets with too few items to make a batch; BlendableDataset compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. data_path accepts multiple arguments (haven't tested but should work, it's just the weighting mechanism that we should have IMO)
  2. works as well. I don't really know how much we're going to be able to experiment with different filtered version of T0.

tests/test_model.py Outdated Show resolved Hide resolved
thomasw21 and others added 2 commits July 4, 2022 09:35
Co-authored-by: Niklas Muennighoff <[email protected]>
 - Remove unecessary code from MTFDataset
 - Create size API for MTF dataset
 - Use new size API to build packed index much faster
Comment on lines +487 to +489
# TODO @thomasw21 handle the case where a single sample cannot fit inside a row. We can
# - silently skip that value [currently implemented]
# - truncate to `seq_length`, and keep the right part
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering is we actually should warn someone about it? Like add a warning or something. I don't know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should almost never happen for T0 & seq len 2048, right?
So having a warning would be good imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ... let me add that.

@thomasw21
Copy link
Member Author

Too late for those that wanted to review.

@thomasw21 thomasw21 merged commit 3d5d151 into main Jul 5, 2022
@thomasw21 thomasw21 deleted the thomas/mtf_train_script branch July 5, 2022 14:03
younesbelkada pushed a commit to younesbelkada/Megatron-DeepSpeed that referenced this pull request Sep 28, 2022
Co-authored-by: Lintang Sutawika <[email protected]>
Co-authored-by: Lintang Sutawika <[email protected]>
Co-authored-by: Muennighoff <[email protected]>
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants