-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer is using DataParallel on parallelized models #9577
Comments
The How is your model parallelized? Without that piece of code we can't reproduce the bug and help you. |
Thanks @sgugger. In my test, I'm using some code originally derived from the run_clm.py example. I'm trying to fine-tune a GPT2 model I've trained from scratch. The model was parallelized with the following lines, and this exact fine-tuning script ran successfully yesterday in 4.1.1, using the
The error I'm getting now looks a lot like what would happen if I left out the |
Please post the full trace. I have only experimented with t5 and bart MP so far, but gpt2 is supposed to be very similar. Most likely the outputs aren't being copied back to the 0th gpu on return, so this won't have anything to do with the trainer. Most likely the issue you encountered has to do with evaluation and not training. I had to fix t5-MP to do that, but the PR with the fix hasn't been merged. transformers/src/transformers/models/t5/modeling_t5.py Lines 1263 to 1266 in 58d047a
I won't be surprised if gpt2 is missing that too.
The current MP implementations are very limited and at the moment I highly recommend you look at DeepSpeed instead, see: We also removed |
That's easy then. The error though very much reminded me of the issue I described in my comment above. |
Thanks both! @stas00 Definitely excited to check out DeepSpeed – that's the reason I started testing my code in 4.2.0 |
Environment info
transformers
version: 4.2.0Who can help
@sgugger @stas00
Information
I'm trying out the 4.2.0 release with a training script that had been working in 4.1.1.
I'm parallelizing my model over two GPUs, and I had been using the
--model_parallel
training arg in the previous version. Now that it's no longer used, I removed the arg from my training command, but I'm getting an error as though the DataParallel is being used and the model isn't being detected as parallelized:RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
I did some debugging, and everything seems okay with my model (
trainer. is_model_parallel
returns True). But thetrainer. args.n_gpu
is still 2.I admit that I don't totally understand what's happening in the trainer code, but it might be an error on line 289?
self.args._n_gpu = 1
Should that be
self.args.n_gpu = 1
, without the leading underscore?To reproduce
Steps to reproduce the behavior:
The text was updated successfully, but these errors were encountered: