Adding language specific validation sets for Multilingual model training #97

hadyelsahar · 2021-09-14T00:39:27Z

Summary

The idea of this issue to modify the megatron-deepspeed to track the progress of validation loss on several validaiton (periodically evaluation) sets separately.

Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

After integration of this PR user could add extra validation sets on the following form

--periodic-eval-data-path \
VALID1-FR-KR 0.1 $DATA_FR 0.2 $DATA_KR, \
VALID2-JP-AR 0.2 $DATA_JP 0.3 $DATA_AR

Validation steps will be run automatically on each dataset independently and results will be displayed on Tensorboard
as following

What was changed

In order not to change the current way one calls the training.py script, I opted for adding an extra argument --periodic-eval-data-path.

Users can define extra datasets sets (EACH in a quite similar way to --data-path to be evaluated along with training by providing their data paths (or multiple with weights).
Note here that the --split argument does apply to the --periodic-eval-data-path argument.

Typical Examples for Multilingual training

When a model is being trained on a preprocessed multilingual dataset multilingual. A user can preprocess 3 monolingual datasets JP, KR, AR, and track their validation progress by sending the following argument:

--data-path $DATA/multilingual
--periodic-eval-data-path \
VALID-JP 1.0 $DATA_JP, \
VALID-KR 1.0 $DATA_KR, \
VALID-AR 1.0 $DATA_AR \

Sometimes in multilingual training, some languages are downsampled and some languages are undersampled. If a user wonders how the model performs with respect to different proportions of languages, different combinations of the languages could be passed as external validation datasets.

--data-path  0.1 $DATA/EN 0.5 $DATA/JP 0.7 $DATA/KR 1.0 $DATA/AR, 
--periodic-eval-data-path \
DATASET-BALANCED 1.0 $DATA_EN 1.0 $DATA_JP 1.0 $DATA_KR 1.0 $DATA_AR, \
DATASET-NO-EN 1.0 $DATA_JP 1.0 $DATA_KR 1.0 $DATA/_R

Connections with PR #143

#143 PR is developed to support the usecase: we can't have multilingual training data and English-only validation data at the moment.. This use case is completely supported in this PR by adding an English-only dataset inside as a dataset to be evaluated periodically. More over one can extend this by adding several datasets to be evaluated periodically, not just a single english only one.

Testing

Default training works 🆗
Integration with Tensorboard 🆗
Testing with real training data multiple combinations
- 1 dataset 1 combination. 🆗
- 1 dataset 2 combinations 🆗
- 2 datasets 1 combinations 🆗
- 2 datasets 2 combinations 🆗
- 5 datasets 5 combinations 🆗

Independent testing @lintangsutawika (in progress)

Future Modifications (suggestions needed)

Adding (optional) periodic-eval-iter and periodic-eval-iterations for this argument periodic-eval-data-path.
if not used then fall back to the regular --eval-interval --eval-iters params.

sbmaruf · 2021-10-01T15:33:48Z

@hadyelsahar Teven asked me to write this one earlier. Not sure if this solves or totally ignores part of the problem discussed in this issue. Please take a look.
#113

hadyelsahar · 2021-10-25T16:50:13Z

@sbmaruf your PR is relevant, however, it doesn't support multiple validation datasets we need that to track progress on multiple languages indepently during training not just english. This PR is quite general (albeit being a dirty hack as well).

For e.g. here we do have 3 validation datasets, the standard one, Valid1 and Valid2

Would be grateful if you can double-check the code.

hadyelsahar · 2021-10-25T18:04:20Z

@TevenLeScao @stas00 would be grateful to have some feedback on this PR if you have time.

stas00 · 2021-10-25T18:15:40Z

I think you guys need to sync with work from #143 as the 2 overlap.

But I will let @TevenLeScao comment on the specifics as he is the owner of that other PR.

ibeltagy · 2021-10-25T18:21:11Z

Tagging @TevenLeScao who was going to start the multilingual training. It would be great if we have this PR merged in the multilingual training.

TevenLeScao · 2021-10-25T19:09:51Z

It hasn't started yet, I can integrate

hadyelsahar · 2021-10-25T21:21:43Z

The benefits of having this PR than #143 is that
this allows multiple validation datasets to be passed which seems to be not supported in the other PR.
To recap the other PR supports two usecases for data loading:

1) --data-path and --split pair
2) --(train|valid|test)-data-path (3 args) and no --split and test is optional.

For syncing both PRs I suggest either of those alternatives:

Keep extra-valid-data-path as an additional argument
Allow option 2 to support multiple validation sets.

@TevenLeScao let me know if this is doable, I have some capacity this week to push this forward if you would like help feel free to ping me on slack.

stas00 · 2021-10-25T22:01:29Z

The benefits of having this PR than #143 is that this allows multiple validation datasets to be passed which seems to be not supported in the other PR. To recap the other PR supports two usecases for data loading:
1) --data-path and --split pair
2) --(train|valid|test)-data-path (3 args) and no --split and test is optional.

It doesn't at the moment - this was just my proposal and seconded by @sbmaruf

Keep extra-valid-data-path as an additional argument

This again leads to very confusing API, because the split behavior is inconsistent.

Allow option 2 to support multiple validation sets.

yes, option 2 would support multiple datasets. that's exactly how --data-path is coded.

hadyelsahar · 2021-10-25T22:21:15Z

yes, option 2 would support multiple datasets. that's exactly how --data-path is coded.

To clarify there's a bit of nuance here --data-path allows multiple datasets to be combined into a single dataset.

    --data-path 0.1 ${DATASET_0} 0.25 ${DATASET_1} 0.2 ${DATASET_2}

What we want here is to allow more than one "multiple datasets" to be combined into Multiple valid sets.

    --valid-data-paths  \
       1.0 ${DATASET_1},        ## valid set 1 
      0.3 ${DATASET_1} 0.3 ${DATASET_1} 0.3 ${DATASET_2}. ## valid set 2

The latter is not supported by the normal --data-path behavior. it needs x-data-loaders , x-data-iterators to be converted into an array of iterators / loaders.

As well, we would like ideally to give each of those combinations a name to be associated with a validation loss.

stas00 · 2021-10-25T22:27:11Z

Thank you for clarifying that you were talking about an extended need, Hady

Then what you said.

The only thing I'm advocating is that if we switch to a different format then let's leave the original 2 args alone `--data-path and --split) and come up with a new set of args that are named differently and ideally somehow imply in mnemonic the format they accept. And have those 2 sets mutually exclusive to avoid confusing the user.

Especially if you may want to use the same new format for train as well? Or is it only ever going to be used for validation?

hadyelsahar · 2021-10-25T22:50:34Z

Great thanks, indeed +1 for keeping the standard --data-path behavior as is, this reduces confusion.

come up with a new set of args that are named differently

Since those datasets will only be used to monitor progress, I suggest a name like --monitor-progress-data-paths , --shadow-eval-data-paths , --periodic-eval-data-path

you may want to use the same new format for train as well? Or is it only ever going to be used for validation?
Yes never for training, only for validation.

To conclude: If I get it right, we should:

keep the normal --data-split behavior
add a new set of arguments with a different name e.g. --monitor-progress-data-paths for all datasets we wish to be tracked during training.

This should support the use case of the other PR we can't have multilingual training data and English-only validation data at the moment.. This can be solved by adding an English-only dataset inside this new attribute.

hadyelsahar · 2021-10-26T09:27:52Z

UPDATE

Based on stas suggestions, --periodic-eval is completely orthogonal to (train|test|valid) data arguments this reduces the confusion of a user who is used to use megatron the traditional way.
The main PR text is up to date: it is self-contained, so no need to follow the conversations.
Imo this PR covers the use case of Add valid data (+TVN fixes) #143 so I would opt for merging this one.

TevenLeScao · 2021-10-26T11:53:00Z

Taking a look at this one, I also think it supersedes #143

sbmaruf · 2021-10-26T11:54:37Z

@hadyelsahar Are we also plotting how many samples are being used for training from each of the data iterator (aka language)? I think that is also necessary to interpret loss values.

TevenLeScao · 2021-10-26T11:55:42Z

Good point @sbmaruf, let's also add that.

TevenLeScao · 2021-10-26T12:50:11Z

API-wise, we're going to have the same issue as #143: if we do not allow the user to re-use the validation splits from --data-path in --periodic-eval, we must re-process every dataset on JZ. Since in this PR, we're adding new evaluation rather than replacing the original, would you guys we OK with re-using in this case?

For example:

-- split 70 30 0
--data-path 0.5 DATASET_A 0.5 DATASET_B
--periodic-eval-path DATASET_A, \
                     DATASET_B \

means:

validation is done on the combination of the 30% valid splits of DATASET_A and DATASET_B
extra eval is done on
1. the 30% valid split of DATASET_A, alone
2. the 30% valid split of DATASET_B, alone

@sbmaruf @stas00

stas00 · 2021-10-27T16:44:14Z

But one problem may occur, there is no way you can make sure that "from a single (*.bin, *.idx) file, train valid split has no overlap". Earlier it was possible. You may have to process the data again.

Are you referring to a user incorrectly inputting the splits, so that they overlap? This can be easily enforced in the code.

Or are you referring to some internal works, in which case could you please be a bit more specific?

If previously you had a split x,y.z you now have exactly the same split, it's just defined in a different way, so it's trivial to convert it back to x,y,z.

TevenLeScao · 2021-10-27T16:46:58Z

No need for RNG tracking @sbmaruf , just need to map the new split argument to the old one.

sbmaruf · 2021-10-27T16:56:31Z

@stas00 @TevenLeScao
Sorry, I missed the new split idea. I was considering the previous --split implementation (train:dev:test %).
Reading your comment on the new split definition on start:length or start:stop implementation makes sense now.
Now it should be trivial to check the overlap.

stas00 · 2021-10-27T16:59:39Z

Awesome, thank you for validating that, @sbmaruf

I guess the only remaining thing then spec-wise is to decide whether to stick to percentage or move to fractions for the split field. I think we should stick with percentage since it's the same as the other --split does it.

hadyelsahar · 2021-10-28T11:56:23Z

Alright so upon a conversation with Teven we agreed to implement the following

A new set of arguments, similar to Stas proposal above but 1) allowing multiple validation & eval dataset groups of datasets 2) add datasetname in front of each group.

--train-weighted-split-path "DATA_ABC 0.6 0:0.6 A 0.3 0:0.8 B  0.1 0:0.8 C"  # train datagroup (only 1 is allowed)
--valid-weighted-split-paths DATA_ABC 0.6 0.6:0.8 A 0.3 0:1 B  0.1 0:1 C, \   #valid datagroup 1
                                 DATA_DE 0.6 0:1 D 0.3 0:1 E,       #valid datagroup 2   
--test-weighted-split-paths DATA_AFG 0.6 0.8:1 A 0.3 0:1 F  0.1 0:1 G, \      # eval datagroup 1
                                 DATA_AHI 0.6 0.8:1 A 0.3 0:1 H 0.3 0:1 I,          # eval datagroup 2

Make sure that the set of variables are orthogonal (i.e. not initialized at the same time)

I'll implement those today & tomorrow and ping Teven who will run the multilingual training after.

stas00 · 2021-10-28T20:23:54Z

May I just suggest some sort of separator between groups? This is very difficult to parse: DATA_ABC 0.6 0.6:0.8 A 0.3 0:1 B 0.1 0:1 C especially once long paths will be added.

may be:

--valid-weighted-split-paths "DATA_ABC: 0.6 0.6:0.8 A, 0.3 0:1 B, 0.1 0:1 C" \   #valid datagroup 1
                             "DATA_DE:  0.6   0:1   D, 0.3 0:1 E"                #valid datagroup 2

so quoted definitions for each subgroup and a specific structure of "GR: defA, defB, defC"

I'm fine with any other proposal, as long as it makes things a little bit easier to parse by a human. otherwise we are just asking for input errors.

Also any reason why train doesn't follow the same syntax? Let's use the same for all even if dataname group is not used. it can just name it TRAIN. it'll be easier to parse too.

hadyelsahar · 2021-10-29T10:22:21Z

Any suggestions on how this part of the code should behave under multiple Validation / Test dataset groups?

Megatron-DeepSpeed/megatron/training.py

Line 955 in 8dc8af5

train_val_test_num_samples = [train_samples,

        train_val_test_num_samples = [train_samples,
                                      eval_iters * args.global_batch_size,
                                      test_iters * args.global_batch_size]
        print_rank_0(' > datasets target sizes (minimum size):')
        print_rank_0('    train:      {}'.format(train_val_test_num_samples[0]))
        print_rank_0('    validation: {}'.format(train_val_test_num_samples[1]))
        print_rank_0('    test:       {}'.format(train_val_test_num_samples[2]))

It is responsible for printing this line

> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      80000
    validation: 800080
    test:       80

sbmaruf · 2021-10-29T12:43:58Z

@hadyelsahar

--train-weighted-split-path "0.6 0:0.6 A"                              # train datagroup (only 1 is allowed)
--valid-weighted-split-paths DATA_ABC 0.6 0.6:0.8 A 0.3 0:1 B  0.1 0:1 C, \   #valid datagroup 1
                                 DATA_DE 0.6 0:1 D 0.3 0:1 E,       #valid datagroup 2   
--test-weighted-split-paths DATA_AFG 0.6 0.8:1 A 0.3 0:1 F  0.1 0:1 G, \      # eval datagroup 1
                                 DATA_AHI 0.6 0.8:1 A 0.3 0:1 H 0.3 0:1 I,          # eval datagroup 2

Why --train-weighted-split-path supports only one iterator?
If I understand correctly, we might need different iterator for different languages.

hadyelsahar · 2021-10-29T13:59:28Z

@sbmaruf no it is a typo in the comment sorry it is not a single data iterator but single data group (combination of different iterators) fixed now.

TevenLeScao · 2021-10-29T15:36:09Z

@hadyelsahar maybe print a list of the lengths of the different dataset mixtures ?

stas00 · 2021-10-29T16:40:33Z

Any suggestions on how this part of the code should behave under multiple Validation / Test dataset groups?

Dump the info for each group and sub-group. perhaps in some yaml like way so it's easy to quickly see the structure.

Dump all that can be dumped and then we can improve the compactness / readability once we see an example of such dump.

hadyelsahar · 2021-10-30T02:49:07Z

Updates

I have added a new set of arguments --(train|valid|test)-weighted-splits-paths (called option2 for dataloading) that are completely orthogonal to option1 for data loading using --data-path and --splits
fixed the logging amount of training data per language for this new data loading option2
remove extra-valid-data-path that we added before
crude testing to make sure that the previous data loading method works and the new data loading does the right thing.

Todo:

deep testing, making sure that correct data splits are being loaded
cleaner logging for data sizes as suggested by Stas above

I will be quite occupied with other stuff in the upcoming days, would be glad if somebody can take over those todos.

Cheers all ❤️

TevenLeScao · 2021-11-02T14:31:12Z

I fixed a few bugs with the test iterator (ensuring test_weighted_split_xxxs are set to None in data-path mode, and a typo that passed the list of iterators instead of each iterator individually to evaluate_and_print_results)

TevenLeScao · 2021-11-02T16:27:59Z

I didn't understand why the tests I'm building were failing and now I realize that we also need to integrate this to prefix-lm. If I can do this fast I will, else we'll keep it on hold until after we finally launch an autoregressive training with this.

TevenLeScao · 2021-11-02T17:15:48Z

Fixed and tested.

TevenLeScao · 2021-11-03T02:15:48Z

So, after a really deep dive:

The splits are respected and are non-overlapping
The dataset weights don't seem to be always respected. It seems to me from reading the code and inspecting the objects that there are edge cases because the dataset sizes are almost always a multiple of their epoch size, so there is variation around the input weights (for example, in the extreme case, if a dataset is bigger than the amount of steps we want to train or validate on it, the entire dataset will still be passed). I have contacted the megatron team about this on Slack.

This feels like more of a Megatron issue if it is the case imo, so I vote to merge this and fix the issue independently.

@stas00

stas00 · 2021-11-03T02:23:32Z

Based on your description it sounds that the current PR is orthogonal to the issue with datasets weights, in which case by all means if you're happy with it, please merge it. Glad you found the issue before we started training with the overlap bug.

But let's make sure that this issue you have uncovered is solved before we start training. So probably please create a new issue to track the progress.

Additionally, if you could create yet another new issue that can track all the must fix issue for 13B-ml training and add the issue from para above to it, that would be very awesome!

Thank you, Teven.

TevenLeScao · 2021-11-03T09:57:28Z

OK, I have confirmed with the Megatron team that this is actually not the case, the dataset weights get applied later in a C helper. Let's merge.

adding multiple datasets to args

073c98e

hadyelsahar self-assigned this Sep 14, 2021

hadyelsahar added enhancement New feature or request multilinguality labels Sep 14, 2021

hadyelsahar requested a review from TevenLeScao September 14, 2021 00:57

hadyelsahar mentioned this pull request Sep 14, 2021

Adding Language specific validation sets to deepspeed bigscience-workshop/multilingual-modeling#1

Open

hadyelsahar marked this pull request as draft September 14, 2021 02:44

Extra_valid_dataset arg & display on Tensorboard

287f124

hadyelsahar marked this pull request as ready for review October 25, 2021 16:41

hadyelsahar added 3 commits October 25, 2021 19:29

code cleaning

5a64b93

code cleaning

20c4838

add run script example for multilingual validation

6447694

rename extra-validation to periodic-eval

f771407

hadyelsahar changed the title ~~WIP: Adding language specific validation steps~~ WIP: Adding language specific validation steps (periodic evaluation) Oct 26, 2021

TevenLeScao mentioned this pull request Oct 27, 2021

Add valid data (+TVN fixes) #143

Closed

add mode two of data loading

38e1fc1

hadyelsahar added 3 commits October 30, 2021 04:20

adding option2 for data loading

5d37e04

fix missing argument and range of split

64e9776

add multilingual run script

8326032

hadyelsahar changed the title ~~Adding language specific validation sets (periodic evaluation)~~ Adding language specific validation sets for Multilingual model training Nov 2, 2021

small cleanups + fixed bugs with test iterator

4c6fa82

Fixed prefixlm

ec00706

adding variable cutoff for last epoch rather than 80%

389936e

TevenLeScao merged commit 846c087 into bigscience-workshop:main Nov 3, 2021

thomasw21 mentioned this pull request Nov 3, 2021

Unable to run evaluation #176

Closed

2 tasks

RaymondLi0 mentioned this pull request Mar 9, 2023

add multi-validation for gpt training bigcode-project/Megatron-LM#32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding language specific validation sets for Multilingual model training #97

Adding language specific validation sets for Multilingual model training #97

hadyelsahar commented Sep 14, 2021 •

edited

Loading

sbmaruf commented Oct 1, 2021 •

edited

Loading

hadyelsahar commented Oct 25, 2021

hadyelsahar commented Oct 25, 2021

stas00 commented Oct 25, 2021 •

edited

Loading

ibeltagy commented Oct 25, 2021

TevenLeScao commented Oct 25, 2021

hadyelsahar commented Oct 25, 2021 •

edited

Loading

stas00 commented Oct 25, 2021 •

edited

Loading

hadyelsahar commented Oct 25, 2021 •

edited

Loading

stas00 commented Oct 25, 2021 •

edited

Loading

hadyelsahar commented Oct 25, 2021 •

edited

Loading

hadyelsahar commented Oct 26, 2021

TevenLeScao commented Oct 26, 2021

sbmaruf commented Oct 26, 2021 •

edited

Loading

TevenLeScao commented Oct 26, 2021

TevenLeScao commented Oct 26, 2021 •

edited

Loading

stas00 commented Oct 27, 2021

TevenLeScao commented Oct 27, 2021

sbmaruf commented Oct 27, 2021 •

edited

Loading

stas00 commented Oct 27, 2021 •

edited

Loading

hadyelsahar commented Oct 28, 2021 •

edited

Loading

stas00 commented Oct 28, 2021 •

edited

Loading

hadyelsahar commented Oct 29, 2021

sbmaruf commented Oct 29, 2021

hadyelsahar commented Oct 29, 2021

TevenLeScao commented Oct 29, 2021

stas00 commented Oct 29, 2021

hadyelsahar commented Oct 30, 2021

TevenLeScao commented Nov 2, 2021

TevenLeScao commented Nov 2, 2021

TevenLeScao commented Nov 2, 2021

TevenLeScao commented Nov 3, 2021

stas00 commented Nov 3, 2021

TevenLeScao commented Nov 3, 2021

Adding language specific validation sets for Multilingual model training #97

Adding language specific validation sets for Multilingual model training #97

Conversation

hadyelsahar commented Sep 14, 2021 • edited Loading

Summary

What was changed

Typical Examples for Multilingual training

Connections with PR #143

Testing

Independent testing @lintangsutawika (in progress)

Future Modifications (suggestions needed)

sbmaruf commented Oct 1, 2021 • edited Loading

hadyelsahar commented Oct 25, 2021

hadyelsahar commented Oct 25, 2021

stas00 commented Oct 25, 2021 • edited Loading

ibeltagy commented Oct 25, 2021

TevenLeScao commented Oct 25, 2021

hadyelsahar commented Oct 25, 2021 • edited Loading

stas00 commented Oct 25, 2021 • edited Loading

hadyelsahar commented Oct 25, 2021 • edited Loading

stas00 commented Oct 25, 2021 • edited Loading

hadyelsahar commented Oct 25, 2021 • edited Loading

hadyelsahar commented Oct 26, 2021

UPDATE

TevenLeScao commented Oct 26, 2021

sbmaruf commented Oct 26, 2021 • edited Loading

TevenLeScao commented Oct 26, 2021

TevenLeScao commented Oct 26, 2021 • edited Loading

stas00 commented Oct 27, 2021

TevenLeScao commented Oct 27, 2021

sbmaruf commented Oct 27, 2021 • edited Loading

stas00 commented Oct 27, 2021 • edited Loading

hadyelsahar commented Oct 28, 2021 • edited Loading

stas00 commented Oct 28, 2021 • edited Loading

hadyelsahar commented Oct 29, 2021

sbmaruf commented Oct 29, 2021

hadyelsahar commented Oct 29, 2021

TevenLeScao commented Oct 29, 2021

stas00 commented Oct 29, 2021

hadyelsahar commented Oct 30, 2021

TevenLeScao commented Nov 2, 2021

TevenLeScao commented Nov 2, 2021

TevenLeScao commented Nov 2, 2021

TevenLeScao commented Nov 3, 2021

stas00 commented Nov 3, 2021

TevenLeScao commented Nov 3, 2021

hadyelsahar commented Sep 14, 2021 •

edited

Loading

sbmaruf commented Oct 1, 2021 •

edited

Loading

stas00 commented Oct 25, 2021 •

edited

Loading

hadyelsahar commented Oct 25, 2021 •

edited

Loading

stas00 commented Oct 25, 2021 •

edited

Loading

hadyelsahar commented Oct 25, 2021 •

edited

Loading

stas00 commented Oct 25, 2021 •

edited

Loading

hadyelsahar commented Oct 25, 2021 •

edited

Loading

sbmaruf commented Oct 26, 2021 •

edited

Loading

TevenLeScao commented Oct 26, 2021 •

edited

Loading

sbmaruf commented Oct 27, 2021 •

edited

Loading

stas00 commented Oct 27, 2021 •

edited

Loading

hadyelsahar commented Oct 28, 2021 •

edited

Loading

stas00 commented Oct 28, 2021 •

edited

Loading