Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix apex distributed training #1124

Merged
merged 11 commits into from
Jul 19, 2019
Merged

Fix apex distributed training #1124

merged 11 commits into from
Jul 19, 2019

Conversation

vinhngx
Copy link
Contributor

@vinhngx vinhngx commented Jul 16, 2019

Fixing issue #1119

@codecov-io
Copy link

codecov-io commented Jul 16, 2019

Codecov Report

Merging #1124 into master will increase coverage by 0.1%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #1124     +/-   ##
=========================================
+ Coverage   64.79%   64.89%   +0.1%     
=========================================
  Files          68       68             
  Lines        5417     5413      -4     
  Branches      835      835             
=========================================
+ Hits         3510     3513      +3     
+ Misses       1648     1643      -5     
+ Partials      259      257      -2
Impacted Files Coverage Δ
torchvision/transforms/transforms.py 80.94% <0%> (+0.58%) ⬆️
torchvision/datasets/fakedata.py 26.92% <0%> (+3.58%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d580a1...df5eefa. Read the comment docs.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix!

I have a few comments. Also, lint seems to be failing.

utils.init_distributed_mode(args)
print(args)
if args.distributed:
torch.cuda.set_device(args.gpu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is normally not needed, because this is already handled by

torch.cuda.set_device(args.gpu)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. I've removed this.

@@ -80,20 +80,21 @@ def _get_cache_path(filepath):
cache_path = os.path.expanduser(cache_path)
return cache_path


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the newline for the linter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -186,6 +182,10 @@ def main(args):
model, optimizer = amp.initialize(model, optimizer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the optimizer in a way that breaks lr_scheduler (i.e., it points to a different optimizer and thus don't update the optimizer lr properly)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good question. I don't have a definitive answer, but maybe a simple approach is just to swap the order of these:

    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=args.lr_step_size, gamma=args.lr_gamma)

    if args.apex:
        model, optimizer = amp.initialize(model, optimizer,
                                          opt_level=args.apex_opt_level
                                          )

thoughts?

Copy link
Contributor Author

@vinhngx vinhngx Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current code order, at epoch 195, I've got this:
Epoch: [195] [ 0/1252] eta: 2:27:28 lr: 0.0008934519439965433 img/s: 755.8400387443562 loss: 0.6304 (0.6304) acc1: 86.7188 (86.7188) acc5: 94.5312 (94.5312) time: 7.0678 data: 6.6480 max mem: 14467
Looks like the learning rate is behaving just fine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amp.initialize documentation does seem to indicate that the returned optimizer is the same object that was passed as an argument, but I do not see any reason to assume this is true or will continue to be true. As @fmassa suggested, we could just define the scheduler after the call to amp.initialize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

def main(args):
if args.output_dir:
utils.mkdir(args.output_dir)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this code block moved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to include torch.cuda.set_device(args.gpu) into apex initialization. This needs to be done after utils.init_distributed_mode(args) as args don't have args.gpu initially. But since we don't need this anymore, apex initialization code can be moved back to the beginning of main()

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more concern that I'd like to clarify before merging this

lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=args.lr_step_size, gamma=args.lr_gamma)

model_without_ddp = model
if args.distributed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very comfortable moving this block after having creating the optimizer.

The reason is that I'm not sure if there are any guarantees that the references to the original parameters of the model will be the same before and after the DDP has been applied.
This means that, potentially, the optimizer would be seeing an old set of parameters from the model, and training would not work as expected.
A classical example is when the user moves the model to the GPU after having constructing the optimizer, which still points to the CPU tensors, and training doesn't happen at all.

@pietern do you think this concern is justified or do we have guarantees that DDP (in all its flavours, single or multi GPU per process) will not modify the references to the original parameters after construction?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a valid concern. DDP doesn't modify the model it wraps. It only replicates it in the case of using multiple replicas per process. But it never mutates the underlying model.

Regardless, you'll have a problem when you move it to GPU, or change precision, or do any other destructive mutation. To compartmentalize, I think it's best to just delay creating the optimizer until you've done all of this, including wrapping in DDP.

Copy link
Contributor Author

@vinhngx vinhngx Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/NVIDIA/apex/tree/master/examples/imagenet

To use DDP with apex.amp, the only gotcha is that

model, optimizer = amp.initialize(model, optimizer, flags...)

must precede

model = DDP(model)

I have tested this ordering, i.e. DDP after optimizer creation & with APEX wrapping, till (almost) convergence with 4 GPUs:

Epoch: [236] [1000/1252] eta: 0:01:31 lr: 0.0003902476545030211 img/s: 788.6827109884629 loss: 0.5897 (0.6214) acc1: 85.5469 (85.1801) acc5: 94.9219 (94.7428) time: 0.3561 data: 0.0002 max mem: 14467

Copy link
Contributor Author

@vinhngx vinhngx Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next I'm testing DDP after optimizer creation but without APEX on 4 GPUs. A few initial iterations show that model is learning well.

Epoch: [0] [1300/2503] eta: 0:07:57 lr: 0.045 img/s: 340.7334779101479 loss: 5.3542 (6.1315) acc1: 5.4688 (2.4344) acc5: 17.1875 (8.0179) time: 0.3745 data: 0.0002 max mem: 13502
Will see how it fares in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let us know how it goes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mid way through 4-GPU training, seems like it converging just fine
Epoch: [39] [1300/2503] eta: 0:07:41 lr: 0.020883504974794645 img/s: 345.68365366660873 loss: 1.3453 (1.3118) acc1: 67.1875 (68.6227) acc5: 85.9375 (87.3265) time: 0.3707 data: 0.0002 max mem: 13502

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my training was cut short at epoch 80 due to out of disk quota error.
Epoch: [80] [2500/2503] eta: 0:00:01 lr: 0.009121630871114108 img/s: 346.88059463297736 loss: 1.0024 (0.9745) acc1: 75.0000 (76.1389) acc5: 90.6250 (91.3506) time: 0.8924 data: 0.0001 max mem: 13502
but overall empirical evidence suggests that moving DDP to after optimizer creation works just fine (with or without APEX).

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

One last question, is this a regression that was introduced in APEX recently?

@fmassa fmassa merged commit c187c2b into pytorch:master Jul 19, 2019
@vinhngx
Copy link
Contributor Author

vinhngx commented Jul 22, 2019

Thanks!

One last question, is this a regression that was introduced in APEX recently?

you mean the throughput? Last time I think I missed the forward pass in the timing & throughput calculation :)

@andravin
Copy link

@vinhngx I think the question is, when did #1119 first appear? Was it always broken, or did a recent update to apex break it?

@vinhngx
Copy link
Contributor Author

vinhngx commented Jul 23, 2019

Got it thanks @andravin. This is a bug in the beginning while incorporating APEX into vision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants