-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding language specific validation sets for Multilingual model training #97
Conversation
@hadyelsahar Teven asked me to write this one earlier. Not sure if this solves or totally ignores part of the problem discussed in this issue. Please take a look. |
@sbmaruf your PR is relevant, however, it doesn't support multiple validation datasets we need that to track progress on multiple languages indepently during training not just english. This PR is quite general (albeit being a dirty hack as well). For e.g. here we do have 3 validation datasets, the standard one, Valid1 and Valid2 Would be grateful if you can double-check the code. |
@TevenLeScao @stas00 would be grateful to have some feedback on this PR if you have time. |
I think you guys need to sync with work from #143 as the 2 overlap. But I will let @TevenLeScao comment on the specifics as he is the owner of that other PR. |
Tagging @TevenLeScao who was going to start the multilingual training. It would be great if we have this PR merged in the multilingual training. |
It hasn't started yet, I can integrate |
The benefits of having this PR than #143 is that
For syncing both PRs I suggest either of those alternatives:
@TevenLeScao let me know if this is doable, I have some capacity this week to push this forward if you would like help feel free to ping me on slack. |
It doesn't at the moment - this was just my proposal and seconded by @sbmaruf
This again leads to very confusing API, because the split behavior is inconsistent.
yes, option 2 would support multiple datasets. that's exactly how |
To clarify there's a bit of nuance here
What we want here is to allow more than one "multiple datasets" to be combined into Multiple valid sets.
The latter is not supported by the normal As well, we would like ideally to give each of those combinations a name to be associated with a validation loss. |
Thank you for clarifying that you were talking about an extended need, Hady Then what you said. The only thing I'm advocating is that if we switch to a different format then let's leave the original 2 args alone `--data-path and --split) and come up with a new set of args that are named differently and ideally somehow imply in mnemonic the format they accept. And have those 2 sets mutually exclusive to avoid confusing the user. Especially if you may want to use the same new format for train as well? Or is it only ever going to be used for validation? |
Great thanks, indeed +1 for keeping the standard --data-path behavior as is, this reduces confusion.
Since those datasets will only be used to monitor progress, I suggest a name like
To conclude: If I get it right, we should:
This should support the use case of the other PR |
UPDATE
|
Taking a look at this one, I also think it supersedes #143 |
@hadyelsahar Are we also plotting how many samples are being used for training from each of the data iterator (aka language)? I think that is also necessary to interpret loss values. |
Good point @sbmaruf, let's also add that. |
API-wise, we're going to have the same issue as #143: if we do not allow the user to re-use the validation splits from For example:
means:
|
Are you referring to a user incorrectly inputting the splits, so that they overlap? This can be easily enforced in the code. Or are you referring to some internal works, in which case could you please be a bit more specific? If previously you had a |
No need for RNG tracking @sbmaruf , just need to map the new split argument to the old one. |
@stas00 @TevenLeScao |
Awesome, thank you for validating that, @sbmaruf I guess the only remaining thing then spec-wise is to decide whether to stick to percentage or move to fractions for the split field. I think we should stick with percentage since it's the same as the other |
Alright so upon a conversation with Teven we agreed to implement the following
I'll implement those today & tomorrow and ping Teven who will run the multilingual training after. |
May I just suggest some sort of separator between groups? This is very difficult to parse: may be:
so quoted definitions for each subgroup and a specific structure of "GR: defA, defB, defC" I'm fine with any other proposal, as long as it makes things a little bit easier to parse by a human. otherwise we are just asking for input errors. Also any reason why train doesn't follow the same syntax? Let's use the same for all even if dataname group is not used. it can just name it TRAIN. it'll be easier to parse too. |
Any suggestions on how this part of the code should behave under multiple Validation / Test dataset groups? Megatron-DeepSpeed/megatron/training.py Line 955 in 8dc8af5
It is responsible for printing this line
|
Why |
@sbmaruf no it is a typo in the comment sorry it is not a single data iterator but single data group (combination of different iterators) fixed now. |
@hadyelsahar maybe print a list of the lengths of the different dataset mixtures ? |
Dump the info for each group and sub-group. perhaps in some yaml like way so it's easy to quickly see the structure. Dump all that can be dumped and then we can improve the compactness / readability once we see an example of such dump. |
Updates
Todo:
I will be quite occupied with other stuff in the upcoming days, would be glad if somebody can take over those todos. Cheers all ❤️ |
I fixed a few bugs with the test iterator (ensuring |
I didn't understand why the tests I'm building were failing and now I realize that we also need to integrate this to prefix-lm. If I can do this fast I will, else we'll keep it on hold until after we finally launch an autoregressive training with this. |
Fixed and tested. |
So, after a really deep dive:
This feels like more of a Megatron issue if it is the case imo, so I vote to merge this and fix the issue independently. |
Based on your description it sounds that the current PR is orthogonal to the issue with datasets weights, in which case by all means if you're happy with it, please merge it. Glad you found the issue before we started training with the overlap bug. But let's make sure that this issue you have uncovered is solved before we start training. So probably please create a new issue to track the progress. Additionally, if you could create yet another new issue that can track all the must fix issue for 13B-ml training and add the issue from para above to it, that would be very awesome! Thank you, Teven. |
OK, I have confirmed with the Megatron team that this is actually not the case, the dataset weights get applied later in a C helper. Let's merge. |
Summary
The idea of this issue to modify the megatron-deepspeed to track the progress of validation loss on several validaiton (periodically evaluation) sets separately.
Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)
After integration of this PR user could add extra validation sets on the following form
Validation steps will be run automatically on each dataset independently and results will be displayed on Tensorboard
as following
What was changed
In order not to change the current way one calls the
training.py
script, I opted for adding an extra argument--periodic-eval-data-path
.Users can define extra datasets sets (EACH in a quite similar way to
--data-path
to be evaluated along with training by providing their data paths (or multiple with weights).Note here that the
--split
argument does apply to the--periodic-eval-data-path
argument.Typical Examples for Multilingual training
When a model is being trained on a preprocessed multilingual dataset
multilingual
. A user can preprocess 3 monolingual datasets JP, KR, AR, and track their validation progress by sending the following argument:Sometimes in multilingual training, some languages are downsampled and some languages are undersampled. If a user wonders how the model performs with respect to different proportions of languages, different combinations of the languages could be passed as external validation datasets.
Connections with PR #143
#143 PR is developed to support the usecase:
we can't have multilingual training data and English-only validation data at the moment.
. This use case is completely supported in this PR by adding an English-only dataset inside as a dataset to be evaluated periodically. More over one can extend this by adding several datasets to be evaluated periodically, not just a single english only one.Testing
Independent testing @lintangsutawika (in progress)
Future Modifications (suggestions needed)
periodic-eval-iter
andperiodic-eval-iterations
for this argumentperiodic-eval-data-path
.if not used then fall back to the regular
--eval-interval --eval-iters
params.