-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix apex distributed training #1124
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1124 +/- ##
=========================================
+ Coverage 64.79% 64.89% +0.1%
=========================================
Files 68 68
Lines 5417 5413 -4
Branches 835 835
=========================================
+ Hits 3510 3513 +3
+ Misses 1648 1643 -5
+ Partials 259 257 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix!
I have a few comments. Also, lint seems to be failing.
references/classification/train.py
Outdated
utils.init_distributed_mode(args) | ||
print(args) | ||
if args.distributed: | ||
torch.cuda.set_device(args.gpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is normally not needed, because this is already handled by
vision/references/classification/utils.py
Line 249 in 8d580a1
torch.cuda.set_device(args.gpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. I've removed this.
references/classification/train.py
Outdated
@@ -80,20 +80,21 @@ def _get_cache_path(filepath): | |||
cache_path = os.path.expanduser(cache_path) | |||
return cache_path | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep the newline for the linter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -186,6 +182,10 @@ def main(args): | |||
model, optimizer = amp.initialize(model, optimizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change the optimizer in a way that breaks lr_scheduler
(i.e., it points to a different optimizer and thus don't update the optimizer lr properly)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good question. I don't have a definitive answer, but maybe a simple approach is just to swap the order of these:
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=args.lr_step_size, gamma=args.lr_gamma)
if args.apex:
model, optimizer = amp.initialize(model, optimizer,
opt_level=args.apex_opt_level
)
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current code order, at epoch 195, I've got this:
Epoch: [195] [ 0/1252] eta: 2:27:28 lr: 0.0008934519439965433 img/s: 755.8400387443562 loss: 0.6304 (0.6304) acc1: 86.7188 (86.7188) acc5: 94.5312 (94.5312) time: 7.0678 data: 6.6480 max mem: 14467
Looks like the learning rate is behaving just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The amp.initialize
documentation does seem to indicate that the returned optimizer is the same object that was passed as an argument, but I do not see any reason to assume this is true or will continue to be true. As @fmassa suggested, we could just define the scheduler after the call to amp.initialize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
references/classification/train.py
Outdated
def main(args): | ||
if args.output_dir: | ||
utils.mkdir(args.output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this code block moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to include torch.cuda.set_device(args.gpu)
into apex initialization. This needs to be done after utils.init_distributed_mode(args)
as args don't have args.gpu
initially. But since we don't need this anymore, apex initialization code can be moved back to the beginning of main()
…_distributed_mode
…Though, doing apex initialization after lr_scheduler seems to work fine as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one more concern that I'd like to clarify before merging this
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=args.lr_step_size, gamma=args.lr_gamma) | ||
|
||
model_without_ddp = model | ||
if args.distributed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very comfortable moving this block after having creating the optimizer
.
The reason is that I'm not sure if there are any guarantees that the references to the original parameters of the model will be the same before and after the DDP has been applied.
This means that, potentially, the optimizer would be seeing an old set of parameters from the model, and training would not work as expected.
A classical example is when the user moves the model to the GPU after having constructing the optimizer, which still points to the CPU tensors, and training doesn't happen at all.
@pietern do you think this concern is justified or do we have guarantees that DDP (in all its flavours, single or multi GPU per process) will not modify the references to the original parameters after construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a valid concern. DDP doesn't modify the model it wraps. It only replicates it in the case of using multiple replicas per process. But it never mutates the underlying model.
Regardless, you'll have a problem when you move it to GPU, or change precision, or do any other destructive mutation. To compartmentalize, I think it's best to just delay creating the optimizer until you've done all of this, including wrapping in DDP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://github.com/NVIDIA/apex/tree/master/examples/imagenet
To use DDP with apex.amp, the only gotcha is that
model, optimizer = amp.initialize(model, optimizer, flags...)
must precede
model = DDP(model)
I have tested this ordering, i.e. DDP after optimizer creation & with APEX wrapping, till (almost) convergence with 4 GPUs:
Epoch: [236] [1000/1252] eta: 0:01:31 lr: 0.0003902476545030211 img/s: 788.6827109884629 loss: 0.5897 (0.6214) acc1: 85.5469 (85.1801) acc5: 94.9219 (94.7428) time: 0.3561 data: 0.0002 max mem: 14467
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next I'm testing DDP after optimizer creation but without APEX on 4 GPUs. A few initial iterations show that model is learning well.
Epoch: [0] [1300/2503] eta: 0:07:57 lr: 0.045 img/s: 340.7334779101479 loss: 5.3542 (6.1315) acc1: 5.4688 (2.4344) acc5: 17.1875 (8.0179) time: 0.3745 data: 0.0002 max mem: 13502
Will see how it fares in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let us know how it goes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mid way through 4-GPU training, seems like it converging just fine
Epoch: [39] [1300/2503] eta: 0:07:41 lr: 0.020883504974794645 img/s: 345.68365366660873 loss: 1.3453 (1.3118) acc1: 67.1875 (68.6227) acc5: 85.9375 (87.3265) time: 0.3707 data: 0.0002 max mem: 13502
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my training was cut short at epoch 80 due to out of disk quota error.
Epoch: [80] [2500/2503] eta: 0:00:01 lr: 0.009121630871114108 img/s: 346.88059463297736 loss: 1.0024 (0.9745) acc1: 75.0000 (76.1389) acc5: 90.6250 (91.3506) time: 0.8924 data: 0.0001 max mem: 13502
but overall empirical evidence suggests that moving DDP to after optimizer creation works just fine (with or without APEX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
One last question, is this a regression that was introduced in APEX recently?
you mean the throughput? Last time I think I missed the forward pass in the timing & throughput calculation :) |
Got it thanks @andravin. This is a bug in the beginning while incorporating APEX into vision. |
Fixing issue #1119