Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue using num_beams parameter for T5 / DeepSpeed #10149

Closed
PeterAJansen opened this issue Feb 12, 2021 · 4 comments
Closed

Issue using num_beams parameter for T5 / DeepSpeed #10149

PeterAJansen opened this issue Feb 12, 2021 · 4 comments

Comments

@PeterAJansen
Copy link

Using a fine-turned seq2seq model, I'd like to generate some number of possible different generations for a given input. One way of typically doing this is using beam search.

Using @stas00 's amazing DeepSpeed additions so that T5-11B will fit in my GPUs, I'm calling the trainer ( finetune_trainer.py
) with only the --do_predict (no train/eval) and (critically) the --num_beams parameter, but this is throwing an error.

I think the issue is likely one of the following:

  1. That this is an unexpected bug/error

  2. That this is normal/expected, and that beam search isn't supported on trainer prediction, but rather normally accomplished using run_distributed_eval.py (as described in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md ). But if I remember correctly I don't think run_distributed_eval.py currently works with DeepSpeed (though I could be wrong?).

I am using a pull from around Feb 4th, so if things have changed in the past week, it's possible that's my issue, too.

Run Script

export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir $OUTPUTDIR --adam_eps 1e-06 --data_dir $DATADIR \
--do_predict \
--num_beams 8 \
--evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length $SEQLEN --max_target_length $SEQLEN --num_train_epochs $EPOCHS \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--predict_with_generate --sortish_sampler \
--test_max_target_length $SEQLEN --val_max_target_length $SEQLEN \
--warmup_steps 5 \
--deepspeed ds_config.json --fp16 \

Error

[2021-02-12 01:02:55,207] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-02-12 01:02:55,861] [INFO] [runner.py:355:main] cmd = /home/pajansen/anaconda3/envs/transformers-feb4-2020/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir output_dir_compexpl-feb10-epoch1-uqa-11b-pretrain-teacher-min6-max8-step2-beam --adam_eps 1e-06 --data_dir /home/pajansen/github/compositional-expl/pretrain/min-6-max-8-noduptest/ --do_predict --num_beams 8 --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 256 --max_target_length 256 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --sortish_sampler --test_max_target_length 256 --val_max_target_length 256 --warmup_steps 5 --deepspeed ds_config.json --fp16
[2021-02-12 01:02:56,753] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-02-12 01:02:56,753] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-02-12 01:02:56,753] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-02-12 01:02:56,753] [INFO] [launch.py:100:main] dist_world_size=4
[2021-02-12 01:02:56,753] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-02-12 01:02:59,580] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-12 01:02:59,723] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-12 01:02:59,828] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-12 01:02:59,976] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 160, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/hf_argparser.py", line 189, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--num_beams', '8']
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 160, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/hf_argparser.py", line 189, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--num_beams', '8']
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 160, in main
    main()
  File "./finetune_trainer.py", line 160, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/hf_argparser.py", line 189, in parse_args_into_dataclasses

  File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/hf_argparser.py", line 189, in parse_args_into_dataclasses
        raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")

ValueErrorValueError: : Some specified arguments are not used by the HfArgumentParser: ['--num_beams', '8']Some specified arguments are not used by the HfArgumentParser: ['--num_beams', '8']
@stas00
Copy link
Contributor

stas00 commented Feb 12, 2021

It's --eval_beams in that particular script:

./finetune_trainer.py -h | grep beams
                           [--tgt_lang TGT_LANG] [--eval_beams EVAL_BEAMS]
  --eval_beams EVAL_BEAMS
                        # num_beams to use for evaluation.

This script is going to be retired soon and run_seq2seq.py is the replacement, and there by my suggestions we switched to num_beams to match the model.config.num_beams

@PeterAJansen
Copy link
Author

Thanks -- I was doing it the complex way and looking through the seqtrainer to verify the num_beams was being passed, when really I should have started with funetune_trainer.py to verify the name was the same. :)

That did get rid of the argument error. But I am now seeing different errors:

  1. I received the "RuntimeError: Input, output and indices must be on the current device" error, but then realized that was fixed in [trainer] deepspeed bug fixes and tests #10039 , so I did a pull of master.

  2. Then I was getting OOM errors when calling trainer with just --do_predict. I tried reducing eval_beams to 1, then excluding the argument all together, and the OOM is still thrown.

  3. To figure out if this was a broader issue from the pull, I've went back to rerunning my fine tuning script, but it's also now throwing OOM on T5-11B (but worked okay on my pull from ~Feb 4th). I'm running a few more tests to try to rule out if it's something I accidentally changed (so far nothing). I should probably start a fresh issue.

@stas00
Copy link
Contributor

stas00 commented Feb 12, 2021

You probably need to start transitioning to run_seq2seq.py as funetune_trainer.py is about to be demoted into the legacy underworld.

I haven't full figured out how to do it as not everything was ported, but I'm updating notes here: #10036 as I learn new nuances - one of the main changes is that datasets are now done in a complete different way.


To figure out if this was a broader issue from the pull, I've went back to rerunning my fine tuning script, but it's also now throwing OOM on T5-11B

Yes, I remember I had encountered that too - I went back to the original scripts that I know worked (#9996) and then started comparing what changes I have done and then discovered which differences I made that led to more GPU usage.

Also note that since the merge of #10114 the DeepSpeed process is completely contained in the train() stage (since it doesn't have anything to offer during eval at the moment). I think this then would impact the ability to load t5-11b 45GB model onto 40GB gpu, because DeepSpeed was loading it in fp16 (22GB), but HF trainer can't do that. But this is a very recent change. I started looking at doing fp16 during eval in HF Trainer, but it looks like this is a wildcard and many models fail to deliver when .halfed.

Before this PR was merged, if you were to train and then eval then the smaller model would avail itself to eval. Not yet sure how to best to proceed - surely if one can train a model, they should be able to eval it too.

edit: looking closer, self.model will remain as it were in train anyway, so actually this PR shouldn't have affected the eval stage - i.e. should remain in fp16 if the trainer set the model. But if train wasn't run it surely won't be able to load in fp32 (45GB>40GB).

@PeterAJansen
Copy link
Author

Thanks -- I migrated to run_seq2seq.py and I'm now able to replicate the OOM error on the README examples (assuming I have DeepSpeed configured correctly). So it does seem like it's a broader issue, and we may back to not being able to train T5-11B on the 40gb cards on the current master (though I can always go back and try to see if there's a commit from the past week that's post-eval-issue fix and pre-new issue).

Since this is unrelated to the ```--num_beams`` argument, I put it in a new issue: #10161 and we can probably close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants