-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
legacy finetune with t5 issues #12848
Comments
first, any reason why you're not using the latest scripts? The legacy scripts are no longer being maintained and the up-to-date scripts had great many improvements. So if it's not too hard I highly recommend switching to those. Most likely you want
After this discussion is over, let's review where you found this information, because this is incorrect. The doc says which specific parameters you need to tweak, not all of them. Have you considered using tuned-up-for-you ah, and you have a typo in at least on of the key names as well - there is no stage3_param_persitance_threshold - deepspeed is a bit troublesome as it doesn't validate keys and simply uses the default if you make a typo. It dumps the final config when the program starts, so you can always review whether your settings "made it". Your config is also "dated" - recent deepspeed moved to a newer config as you can see in the docs (albeit it's backward compatible). |
Perhaps you were referring to: "Smaller values use less memory"
|
Thanks for the pointers. I modified my ds_confg.json with the following:
I also switched to run_translation.py in the master branch. Even with the
I am unable to use a batchsize of 2 per gpu without hitting OOM for GPU. Any thoughts on optimizing this? My commandline is:
|
I had no problem doing mostly the same with the current version of examples with just 4x v100-16GB GPUs - I didn't change anything from the default ds config in the repo and it took only 6GB / gpu for training and ~10GB / gpu for eval.
probably can easily do a much larger BS on this one and 8 gpus you definitely shouldn't have any problems. I highly recommend to use the default ds config and not change anything there unless you really need to. |
I was able to use your command and train using the ro-en dataset and t5-3b. However, I am trying to use a custom model: "Rostlab/prot_t5_xl_uniref50". This is based on t5-3b, but without the denoising objective in t5. I looked at the model card and it also does not have the task-specific parameters in its config.json for translation/summarization. I think this means that I might need to change the Trainer, but I am not sure what is specifically needed. Before I started down the deepspeed path, I was using a training loop that I had created with model parallelization. The train step is below:
Doing this, I was only able to train on 2 batches at once. Is it possible to use trainer with this model or do you have any pointers on transferring this to deepspeed? |
You don't need to transfer anything to Deepspeed, Deepspeed ZeRO simply provides a much simpler way of doing model parallelism w/o needing to change the model. That is whatever model you use it'll just work. Deepspeed magically parallelizes whatever you throw at it (well, most of the time). So your goal is to use a t5-3b model with a slightly different task. I don't see any reason why it won't just work out of the box. I used Perhaps you can follow this plan:
|
I am not sure what is going on...I stepped through the code and made sure that I was not missing anything by printing out the tokens/masks and several other points. The only thing that I can get to work with this model, dataset, and run_translation.py is a per_device_batch_size of 1. I am using the tests/deepspeed/ds_config_zero3.json with the run_translation.py script. I have been able to use the original t5-3b model with the ro-en translation dataset and your configuration file with a per device batch size of 8 just fine. Not sure where to go from here. Thanks! |
a model is a model is a model is a model - it doesn't matter which t5-3b derivative you use - it will take the exact same amount of memory. What matters is your code - it's possible that you do something that leaks memory or allocates more than the example program does. The next step is to either to try to compare how your program is different, or to use the memory profiler and see where the bulk of memory is allocated. You can start with just enabling |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @stas00
Splitting of from #8771 (comment)
There is a lot of great information in your post; thanks for being thorough!
I guess I dont understand what parameters I need to change within the deepspeed config file to properly offload into cpu memory. I have 473 gb of RAM available for offloading, which seems to be enough based on what you listed. I am also using the finetune script in the seq2seq legacy folder. The command is:
export BS=2; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 deepspeed --num_gpus=8 ./finetune_trainer.py --model_name_or_path "Rostlab/prot_t5_xl_uniref50" --output_dir output_dir --adam_eps 1e-06 --data_dir /mnt/data --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 512 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 --sortish_sampler --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ../../../tests/deepspeed/ds_config_zero3.json --fp16
I had to modify finetune to include the T5Tokenizer as the AutoTokenizer wouldnt work.
For zero 3 optimization, I am using lower values for stage3_params, since the documentation indicated to use lower values to offload memory.
The text was updated successfully, but these errors were encountered: