Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding language specific validation sets for Multilingual model training #97

Merged
merged 19 commits into from
Nov 3, 2021

Conversation

hadyelsahar
Copy link
Contributor

@hadyelsahar hadyelsahar commented Sep 14, 2021

Summary

The idea of this issue to modify the megatron-deepspeed to track the progress of validation loss on several validaiton (periodically evaluation) sets separately.

Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

image

After integration of this PR user could add extra validation sets on the following form

--periodic-eval-data-path \
VALID1-FR-KR 0.1 $DATA_FR 0.2 $DATA_KR, \
VALID2-JP-AR 0.2 $DATA_JP 0.3 $DATA_AR

Validation steps will be run automatically on each dataset independently and results will be displayed on Tensorboard
as following

image

What was changed

In order not to change the current way one calls the training.py script, I opted for adding an extra argument --periodic-eval-data-path.

Users can define extra datasets sets (EACH in a quite similar way to --data-path to be evaluated along with training by providing their data paths (or multiple with weights).
Note here that the --split argument does apply to the --periodic-eval-data-path argument.

Typical Examples for Multilingual training

When a model is being trained on a preprocessed multilingual dataset multilingual. A user can preprocess 3 monolingual datasets JP, KR, AR, and track their validation progress by sending the following argument:

--data-path $DATA/multilingual
--periodic-eval-data-path \
VALID-JP 1.0 $DATA_JP, \
VALID-KR 1.0 $DATA_KR, \
VALID-AR 1.0 $DATA_AR \

Sometimes in multilingual training, some languages are downsampled and some languages are undersampled. If a user wonders how the model performs with respect to different proportions of languages, different combinations of the languages could be passed as external validation datasets.

--data-path  0.1 $DATA/EN 0.5 $DATA/JP 0.7 $DATA/KR 1.0 $DATA/AR, 
--periodic-eval-data-path \
DATASET-BALANCED 1.0 $DATA_EN 1.0 $DATA_JP 1.0 $DATA_KR 1.0 $DATA_AR, \
DATASET-NO-EN 1.0 $DATA_JP 1.0 $DATA_KR 1.0 $DATA/_R

Connections with PR #143

#143 PR is developed to support the usecase: we can't have multilingual training data and English-only validation data at the moment.. This use case is completely supported in this PR by adding an English-only dataset inside as a dataset to be evaluated periodically. More over one can extend this by adding several datasets to be evaluated periodically, not just a single english only one.

Testing

  • Default training works 🆗
  • Integration with Tensorboard 🆗
  • Testing with real training data multiple combinations
    • 1 dataset 1 combination. 🆗
    • 1 dataset 2 combinations 🆗
    • 2 datasets 1 combinations 🆗
    • 2 datasets 2 combinations 🆗
    • 5 datasets 5 combinations 🆗

Independent testing @lintangsutawika (in progress)

Future Modifications (suggestions needed)

  • Adding (optional) periodic-eval-iter and periodic-eval-iterations for this argument periodic-eval-data-path.
    if not used then fall back to the regular --eval-interval --eval-iters params.

@sbmaruf
Copy link
Collaborator

sbmaruf commented Oct 1, 2021

@hadyelsahar Teven asked me to write this one earlier. Not sure if this solves or totally ignores part of the problem discussed in this issue. Please take a look.
#113

@hadyelsahar hadyelsahar marked this pull request as ready for review October 25, 2021 16:41
@hadyelsahar
Copy link
Contributor Author

@sbmaruf your PR is relevant, however, it doesn't support multiple validation datasets we need that to track progress on multiple languages indepently during training not just english. This PR is quite general (albeit being a dirty hack as well).

For e.g. here we do have 3 validation datasets, the standard one, Valid1 and Valid2
image

Would be grateful if you can double-check the code.

@hadyelsahar
Copy link
Contributor Author

@TevenLeScao @stas00 would be grateful to have some feedback on this PR if you have time.

@stas00
Copy link
Contributor

stas00 commented Oct 25, 2021

I think you guys need to sync with work from #143 as the 2 overlap.

But I will let @TevenLeScao comment on the specifics as he is the owner of that other PR.

@ibeltagy
Copy link
Member

Tagging @TevenLeScao who was going to start the multilingual training. It would be great if we have this PR merged in the multilingual training.

@TevenLeScao
Copy link
Collaborator

It hasn't started yet, I can integrate

@hadyelsahar
Copy link
Contributor Author

hadyelsahar commented Oct 25, 2021

The benefits of having this PR than #143 is that
this allows multiple validation datasets to be passed which seems to be not supported in the other PR.
To recap the other PR supports two usecases for data loading:

1) --data-path and --split pair
2) --(train|valid|test)-data-path (3 args) and no --split and test is optional.

For syncing both PRs I suggest either of those alternatives:

  • Keep extra-valid-data-path as an additional argument
  • Allow option 2 to support multiple validation sets.

@TevenLeScao let me know if this is doable, I have some capacity this week to push this forward if you would like help feel free to ping me on slack.

@stas00
Copy link
Contributor

stas00 commented Oct 25, 2021

The benefits of having this PR than #143 is that this allows multiple validation datasets to be passed which seems to be not supported in the other PR. To recap the other PR supports two usecases for data loading:

1) --data-path and --split pair
2) --(train|valid|test)-data-path (3 args) and no --split and test is optional.

It doesn't at the moment - this was just my proposal and seconded by @sbmaruf

Keep extra-valid-data-path as an additional argument

This again leads to very confusing API, because the split behavior is inconsistent.

Allow option 2 to support multiple validation sets.

yes, option 2 would support multiple datasets. that's exactly how --data-path is coded.

@hadyelsahar
Copy link
Contributor Author

hadyelsahar commented Oct 25, 2021

yes, option 2 would support multiple datasets. that's exactly how --data-path is coded.

To clarify there's a bit of nuance here --data-path allows multiple datasets to be combined into a single dataset.

    --data-path 0.1 ${DATASET_0} 0.25 ${DATASET_1} 0.2 ${DATASET_2}

What we want here is to allow more than one "multiple datasets" to be combined into Multiple valid sets.

    --valid-data-paths  \
       1.0 ${DATASET_1},        ## valid set 1 
      0.3 ${DATASET_1} 0.3 ${DATASET_1} 0.3 ${DATASET_2}. ## valid set 2

The latter is not supported by the normal --data-path behavior. it needs x-data-loaders , x-data-iterators to be converted into an array of iterators / loaders.

As well, we would like ideally to give each of those combinations a name to be associated with a validation loss.

@stas00
Copy link
Contributor

stas00 commented Oct 25, 2021

Thank you for clarifying that you were talking about an extended need, Hady

Then what you said.

The only thing I'm advocating is that if we switch to a different format then let's leave the original 2 args alone `--data-path and --split) and come up with a new set of args that are named differently and ideally somehow imply in mnemonic the format they accept. And have those 2 sets mutually exclusive to avoid confusing the user.

Especially if you may want to use the same new format for train as well? Or is it only ever going to be used for validation?

@hadyelsahar
Copy link
Contributor Author

hadyelsahar commented Oct 25, 2021

Great thanks, indeed +1 for keeping the standard --data-path behavior as is, this reduces confusion.

come up with a new set of args that are named differently

Since those datasets will only be used to monitor progress, I suggest a name like --monitor-progress-data-paths , --shadow-eval-data-paths , --periodic-eval-data-path

you may want to use the same new format for train as well? Or is it only ever going to be used for validation?
Yes never for training, only for validation.

To conclude: If I get it right, we should:

  • keep the normal --data-split behavior
  • add a new set of arguments with a different name e.g. --monitor-progress-data-paths for all datasets we wish to be tracked during training.

This should support the use case of the other PR we can't have multilingual training data and English-only validation data at the moment.. This can be solved by adding an English-only dataset inside this new attribute.

@hadyelsahar hadyelsahar changed the title WIP: Adding language specific validation steps WIP: Adding language specific validation steps (periodic evaluation) Oct 26, 2021
@hadyelsahar
Copy link
Contributor Author

UPDATE

  • Based on stas suggestions, --periodic-eval is completely orthogonal to (train|test|valid) data arguments this reduces the confusion of a user who is used to use megatron the traditional way.

  • The main PR text is up to date: it is self-contained, so no need to follow the conversations.

  • Imo this PR covers the use case of Add valid data (+TVN fixes) #143 so I would opt for merging this one.

@TevenLeScao
Copy link
Collaborator

Taking a look at this one, I also think it supersedes #143

@sbmaruf
Copy link
Collaborator

sbmaruf commented Oct 26, 2021

@hadyelsahar Are we also plotting how many samples are being used for training from each of the data iterator (aka language)? I think that is also necessary to interpret loss values.

@TevenLeScao
Copy link
Collaborator

Good point @sbmaruf, let's also add that.

@TevenLeScao
Copy link
Collaborator

TevenLeScao commented Oct 26, 2021

API-wise, we're going to have the same issue as #143: if we do not allow the user to re-use the validation splits from --data-path in --periodic-eval, we must re-process every dataset on JZ. Since in this PR, we're adding new evaluation rather than replacing the original, would you guys we OK with re-using in this case?

For example:

-- split 70 30 0
--data-path 0.5 DATASET_A 0.5 DATASET_B
--periodic-eval-path DATASET_A, \
                     DATASET_B \

means:

  • validation is done on the combination of the 30% valid splits of DATASET_A and DATASET_B
  • extra eval is done on
    1. the 30% valid split of DATASET_A, alone
    2. the 30% valid split of DATASET_B, alone

@sbmaruf @stas00

@stas00
Copy link
Contributor

stas00 commented Oct 27, 2021

But one problem may occur, there is no way you can make sure that "from a single (*.bin, *.idx) file, train valid split has no overlap". Earlier it was possible. You may have to process the data again.

Are you referring to a user incorrectly inputting the splits, so that they overlap? This can be easily enforced in the code.

Or are you referring to some internal works, in which case could you please be a bit more specific?

If previously you had a split x,y.z you now have exactly the same split, it's just defined in a different way, so it's trivial to convert it back to x,y,z.

@TevenLeScao
Copy link
Collaborator

No need for RNG tracking @sbmaruf , just need to map the new split argument to the old one.

@sbmaruf
Copy link
Collaborator

sbmaruf commented Oct 27, 2021

@stas00 @TevenLeScao
Sorry, I missed the new split idea. I was considering the previous --split implementation (train:dev:test %).
Reading your comment on the new split definition on start:length or start:stop implementation makes sense now.
Now it should be trivial to check the overlap.

@stas00
Copy link
Contributor

stas00 commented Oct 27, 2021

Awesome, thank you for validating that, @sbmaruf

I guess the only remaining thing then spec-wise is to decide whether to stick to percentage or move to fractions for the split field. I think we should stick with percentage since it's the same as the other --split does it.

@hadyelsahar
Copy link
Contributor Author

hadyelsahar commented Oct 28, 2021

Alright so upon a conversation with Teven we agreed to implement the following

  • A new set of arguments, similar to Stas proposal above but 1) allowing multiple validation & eval dataset groups of datasets 2) add datasetname in front of each group.
--train-weighted-split-path "DATA_ABC 0.6 0:0.6 A 0.3 0:0.8 B  0.1 0:0.8 C"  # train datagroup (only 1 is allowed)
--valid-weighted-split-paths DATA_ABC 0.6 0.6:0.8 A 0.3 0:1 B  0.1 0:1 C, \   #valid datagroup 1
                                 DATA_DE 0.6 0:1 D 0.3 0:1 E,       #valid datagroup 2   
--test-weighted-split-paths DATA_AFG 0.6 0.8:1 A 0.3 0:1 F  0.1 0:1 G, \      # eval datagroup 1
                                 DATA_AHI 0.6 0.8:1 A 0.3 0:1 H 0.3 0:1 I,          # eval datagroup 2 
  • Make sure that the set of variables are orthogonal (i.e. not initialized at the same time)

I'll implement those today & tomorrow and ping Teven who will run the multilingual training after.

@stas00
Copy link
Contributor

stas00 commented Oct 28, 2021

May I just suggest some sort of separator between groups? This is very difficult to parse: DATA_ABC 0.6 0.6:0.8 A 0.3 0:1 B 0.1 0:1 C especially once long paths will be added.

may be:

--valid-weighted-split-paths "DATA_ABC: 0.6 0.6:0.8 A, 0.3 0:1 B, 0.1 0:1 C" \   #valid datagroup 1
                             "DATA_DE:  0.6   0:1   D, 0.3 0:1 E"                #valid datagroup 2   

so quoted definitions for each subgroup and a specific structure of "GR: defA, defB, defC"

I'm fine with any other proposal, as long as it makes things a little bit easier to parse by a human. otherwise we are just asking for input errors.


Also any reason why train doesn't follow the same syntax? Let's use the same for all even if dataname group is not used. it can just name it TRAIN. it'll be easier to parse too.

@hadyelsahar
Copy link
Contributor Author

Any suggestions on how this part of the code should behave under multiple Validation / Test dataset groups?

train_val_test_num_samples = [train_samples,

        train_val_test_num_samples = [train_samples,
                                      eval_iters * args.global_batch_size,
                                      test_iters * args.global_batch_size]
        print_rank_0(' > datasets target sizes (minimum size):')
        print_rank_0('    train:      {}'.format(train_val_test_num_samples[0]))
        print_rank_0('    validation: {}'.format(train_val_test_num_samples[1]))
        print_rank_0('    test:       {}'.format(train_val_test_num_samples[2]))

It is responsible for printing this line

> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      80000
    validation: 800080
    test:       80

@sbmaruf
Copy link
Collaborator

sbmaruf commented Oct 29, 2021

@hadyelsahar

--train-weighted-split-path "0.6 0:0.6 A"                              # train datagroup (only 1 is allowed)
--valid-weighted-split-paths DATA_ABC 0.6 0.6:0.8 A 0.3 0:1 B  0.1 0:1 C, \   #valid datagroup 1
                                 DATA_DE 0.6 0:1 D 0.3 0:1 E,       #valid datagroup 2   
--test-weighted-split-paths DATA_AFG 0.6 0.8:1 A 0.3 0:1 F  0.1 0:1 G, \      # eval datagroup 1
                                 DATA_AHI 0.6 0.8:1 A 0.3 0:1 H 0.3 0:1 I,          # eval datagroup 2 

Why --train-weighted-split-path supports only one iterator?
If I understand correctly, we might need different iterator for different languages.

@hadyelsahar
Copy link
Contributor Author

@sbmaruf no it is a typo in the comment sorry it is not a single data iterator but single data group (combination of different iterators) fixed now.

@TevenLeScao
Copy link
Collaborator

@hadyelsahar maybe print a list of the lengths of the different dataset mixtures ?

@stas00
Copy link
Contributor

stas00 commented Oct 29, 2021

Any suggestions on how this part of the code should behave under multiple Validation / Test dataset groups?

Dump the info for each group and sub-group. perhaps in some yaml like way so it's easy to quickly see the structure.

Dump all that can be dumped and then we can improve the compactness / readability once we see an example of such dump.

@hadyelsahar
Copy link
Contributor Author

Updates

  • I have added a new set of arguments --(train|valid|test)-weighted-splits-paths (called option2 for dataloading) that are completely orthogonal to option1 for data loading using --data-path and --splits
  • fixed the logging amount of training data per language for this new data loading option2
  • remove extra-valid-data-path that we added before
  • crude testing to make sure that the previous data loading method works and the new data loading does the right thing.

Todo:

  • deep testing, making sure that correct data splits are being loaded
  • cleaner logging for data sizes as suggested by Stas above

I will be quite occupied with other stuff in the upcoming days, would be glad if somebody can take over those todos.

Cheers all ❤️

@hadyelsahar hadyelsahar changed the title Adding language specific validation sets (periodic evaluation) Adding language specific validation sets for Multilingual model training Nov 2, 2021
@TevenLeScao
Copy link
Collaborator

I fixed a few bugs with the test iterator (ensuring test_weighted_split_xxxs are set to None in data-path mode, and a typo that passed the list of iterators instead of each iterator individually to evaluate_and_print_results)

@TevenLeScao
Copy link
Collaborator

I didn't understand why the tests I'm building were failing and now I realize that we also need to integrate this to prefix-lm. If I can do this fast I will, else we'll keep it on hold until after we finally launch an autoregressive training with this.

@TevenLeScao
Copy link
Collaborator

Fixed and tested.

@TevenLeScao
Copy link
Collaborator

So, after a really deep dive:

  • The splits are respected and are non-overlapping
  • The dataset weights don't seem to be always respected. It seems to me from reading the code and inspecting the objects that there are edge cases because the dataset sizes are almost always a multiple of their epoch size, so there is variation around the input weights (for example, in the extreme case, if a dataset is bigger than the amount of steps we want to train or validate on it, the entire dataset will still be passed). I have contacted the megatron team about this on Slack.

This feels like more of a Megatron issue if it is the case imo, so I vote to merge this and fix the issue independently.

@stas00

@stas00
Copy link
Contributor

stas00 commented Nov 3, 2021

Based on your description it sounds that the current PR is orthogonal to the issue with datasets weights, in which case by all means if you're happy with it, please merge it. Glad you found the issue before we started training with the overlap bug.

But let's make sure that this issue you have uncovered is solved before we start training. So probably please create a new issue to track the progress.

Additionally, if you could create yet another new issue that can track all the must fix issue for 13B-ml training and add the issue from para above to it, that would be very awesome!

Thank you, Teven.

@TevenLeScao
Copy link
Collaborator

OK, I have confirmed with the Megatron team that this is actually not the case, the dataset weights get applied later in a C helper. Let's merge.

@TevenLeScao TevenLeScao merged commit 846c087 into bigscience-workshop:main Nov 3, 2021
@thomasw21 thomasw21 mentioned this pull request Nov 3, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request multilinguality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants