-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchvision classification train.py script fails with DistributedDataParallel and --apex #1119
Comments
Here is the relevant section of the train.py script: vision/references/classification/train.py Lines 174 to 188 in 2b73a48
|
@vinhngx given that you have originally added support for APEX with mixed precision, can you have a look? I have no experience with APEX and I'm busy with other things now, so I wouldn't have a chance to look at this. |
Glad someone spotted this :)
In addition:
Therefore, we also need to add
I've created a PR here: #1124 |
@vinhngx |
indeed it is. Thanks for pointing this out. So we don't need to do it in the Apex initialization code. |
@vinhngx One question, is this a change in behavior in APEX, or was it a bug since the beginning? |
@fmassa I suppose it is a bug in the beginning |
Fixed in #1124 |
Environment:
torch 1.1, CUDA 10.0, cuDNN 7.5, torchvision 0.3, apex built from master (commit 574fe2449cbe6ae4c8af53c6ecb1b5fc13877234)
Summary:
The torchvision references train.py script fails when used with DistributedDataParallel and --apex. The error indicates that "the parallel wrappers should only be applied to the model(s) AFTER the model(s) have been returned from amp.initialize"
Commandline:
Output:
Fix:
Simply moving the
torch.nn.parallel.DistributedDataParallel
call down a few lines in the script so that it happens after theamp.initialize
call seems to fix the issue, though I have not yet tested it thoroughly with different combinations of commandline arguments.The text was updated successfully, but these errors were encountered: