-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use finetuner.py to train t5-large model #17534
Comments
This approach you tried is very old and is not supported any longer. Please switch to modern tools and it should just work. Here are a few current examples: straight DDP:
same with deepspeed
make sure it works, adapt to your data, and then replace with the large model size. Please let me know if this unblocked you and please share the link where you found the old info so that we could update that thread with the new information. Thank you |
Hi @stas00 , the order info comes from here. I run the following scripts to install required package:
I tried the straight DPP and deepspeed scripts, all says the following error though I add "--per_device_train_batch_size 2":
What's more, I want to run language inference task with t5 model and do you have any recommendation which example script should I use? |
oops, my bad - I fixed the examples in my reply #17534 (comment)
same script, you just tell it to eval instead of train, here is a few ways for one gpu:
and you can adapt those to multi-gpu and/or deepspeed based on the first examples I shared. but basically I removed the training args and replaced those with eval-only args. The 2nd (last) example shows how to do it in half-precision which may not work well (depending on the model), so start with the normal fp32 eval (i.e. w/o Of course, play with the values of the args to fit your environment.
you don't download it directly - |
Thanks for your detailed reply @stas00 ! I tried the t5-small model and they works so I changed it to t5-11b with 3 questions here. In my case, I could not use straight DDP otherwise CUDA will run out of memory. When I use deepspeed script
It said that
But I find some info and do add And can I add a parameter like here i.e. --sharded_ddp to use sharded_ddp instead of straight ddp?(I am not sure I totally understand the definition of straight ddp and sharded ddp) In my previous code, I will pass some generator option to t5 model
So how can I do the same thing here? |
Normally you just kill them manually. Upgrade your You should pass an explicit argument to
That's another implementation of ZeRO protocol. You don't need it.
Please run:
you will see the existing options there (e.g. . If you want to customize the example script, these transformers/examples/pytorch/translation/run_translation.py Lines 589 to 593 in 26e5e12
all the transformers/src/transformers/generation_utils.py Lines 844 to 887 in 26e5e12
|
Hi @stas00 , thank you for your reply! I trained it wity your updated scripts but the job was stopped accidently. So I tried to resume from checkpoints by this scripts(without --overwrite_output_dir, and output_dir_1 is the folder with checkpoints)
But it said that
|
this usually means that you didn't have enough cpu memory to resume Unfortunately it's a bug in deepspeed, where instead of loading the checkpoint directly to gpu it first loads it to cpu. I can offer you a hack that may help. Basically you need to stagger the checkpoint loading so that not all 4 processes try to load it to cpu memory at once. |
something like this should work to stagger the checkpoint loading:
adjust 20 to perhaps smaller or longer wait in secs. so here the following happens: process 0 sleeps for 0 secs, process 1 for 20 secs, 2 for 40 secs, etc. so each process gets full use of CPU memory alone. you can apply the patch manually or with:
assuming you saved my code as patch.txt (attached it to this comment as well so you can just download it) |
@stas00 ,Thank you! I have sucessfully trained the t5-11b. And here, I want to do the inference in my setup code. Since it's hard to load t5-11b on one GPU, I use model.parallelize to do the inference part.
But the errors said:
I have find some solution said that set |
Super!
|
@stas00 ,Thanks a lot! In my case, we use pytorch-lightning and what I want to do is
But although I am just doing prediction, why it will still call the In addition to that, it gave an error although I do have ninja package.
I am just worried about is it reasonable to work like this?
|
wrt to the traceback you shared,
it should give you the path to the binary. Don't try to run deepspeed again until the above returns the path. if it returns nothing it means that your python's env wrt PL-specific issues please ask at PL Issues as I'm not a PL user. |
there is another workaround that requires no ninja and it's to prebuild deepspeed https://huggingface.co/docs/transformers/main/main_classes/deepspeed#installation (local install where you clone deepspeed and then build it) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
The git apply patch.txt throws an error of Am I missing something in the application of it, or missing an argument? |
bad copy-n-paste? Just insert it manually - it's just a few lines of code and you can tell where to insert by the context around it. |
I haven't tried it, but I don't see any reason why it shouldn't work. OPT has been out for quite a few months now so surely if it didn't work we would have heard by now and fixed it. Give it a try and if you run into problems please start a new Issue. Thank you. |
System Info
Who can help?
@stas00
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Follow the steps here
git clone https://github.com/huggingface/transformers
cd transformers
git checkout 7e662e6
cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
pip install -r requirement.txt
cd ../..
pip install .
cd examples/seq2seq
pip install fairscale, deepspeed==0.3.10
#run script 1
export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500
Error trace1
#run script 2
export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 python -m torch.distributed.launch --nproc_per_node=2 ./run_seq2seq.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --dataset_name wmt16 --dataset_config "ro-en" --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500
#Error trace2
Expected behavior
I hope that it will run the model with deepspeed or shared techniques. Actually I want to train the t5-11b model and want to change the dataset dir to my dataset but even can not reproduce what @stas00 shared before.
The text was updated successfully, but these errors were encountered: