Fix apex distributed training #1124

vinhngx · 2019-07-16T02:01:45Z

Fixing issue #1119

codecov-io · 2019-07-16T02:47:46Z

Codecov Report

Merging #1124 into master will increase coverage by 0.1%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master    #1124     +/-   ##
=========================================
+ Coverage   64.79%   64.89%   +0.1%     
=========================================
  Files          68       68             
  Lines        5417     5413      -4     
  Branches      835      835             
=========================================
+ Hits         3510     3513      +3     
+ Misses       1648     1643      -5     
+ Partials      259      257      -2

Impacted Files	Coverage Δ
torchvision/transforms/transforms.py	`80.94% <0%> (+0.58%)`	⬆️
torchvision/datasets/fakedata.py	`26.92% <0%> (+3.58%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d580a1...df5eefa. Read the comment docs.

fmassa

Thanks for the quick fix!

I have a few comments. Also, lint seems to be failing.

fmassa · 2019-07-16T08:15:08Z

references/classification/train.py

-    utils.init_distributed_mode(args)
-    print(args)
+        if args.distributed:
+            torch.cuda.set_device(args.gpu)


This is normally not needed, because this is already handled by

vision/references/classification/utils.py

Line 249 in 8d580a1

torch.cuda.set_device(args.gpu)

correct. I've removed this.

fmassa · 2019-07-16T08:15:32Z

references/classification/train.py

@@ -80,20 +80,21 @@ def _get_cache_path(filepath):
    cache_path = os.path.expanduser(cache_path)
    return cache_path

-


Can you keep the newline for the linter?

fmassa · 2019-07-16T08:19:13Z

references/classification/train.py

@@ -186,6 +182,10 @@ def main(args):
        model, optimizer = amp.initialize(model, optimizer,


Does this change the optimizer in a way that breaks lr_scheduler (i.e., it points to a different optimizer and thus don't update the optimizer lr properly)?

very good question. I don't have a definitive answer, but maybe a simple approach is just to swap the order of these:

lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=args.lr_step_size, gamma=args.lr_gamma) if args.apex: model, optimizer = amp.initialize(model, optimizer, opt_level=args.apex_opt_level )

thoughts?

With the current code order, at epoch 195, I've got this:
Epoch: [195] [ 0/1252] eta: 2:27:28 lr: 0.0008934519439965433 img/s: 755.8400387443562 loss: 0.6304 (0.6304) acc1: 86.7188 (86.7188) acc5: 94.5312 (94.5312) time: 7.0678 data: 6.6480 max mem: 14467
Looks like the learning rate is behaving just fine.

The amp.initialize documentation does seem to indicate that the returned optimizer is the same object that was passed as an argument, but I do not see any reason to assume this is true or will continue to be true. As @fmassa suggested, we could just define the scheduler after the call to amp.initialize.

andravin · 2019-07-16T08:28:06Z

references/classification/train.py

 def main(args):
+    if args.output_dir:
+        utils.mkdir(args.output_dir)


Why was this code block moved?

I wanted to include torch.cuda.set_device(args.gpu) into apex initialization. This needs to be done after utils.init_distributed_mode(args) as args don't have args.gpu initially. But since we don't need this anymore, apex initialization code can be moved back to the beginning of main()

…_distributed_mode

…Though, doing apex initialization after lr_scheduler seems to work fine as well

fmassa

I have one more concern that I'd like to clarify before merging this

fmassa · 2019-07-17T11:52:18Z

references/classification/train.py

+    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=args.lr_step_size, gamma=args.lr_gamma)
+
+    model_without_ddp = model
+    if args.distributed:


I'm not very comfortable moving this block after having creating the optimizer.

The reason is that I'm not sure if there are any guarantees that the references to the original parameters of the model will be the same before and after the DDP has been applied.
This means that, potentially, the optimizer would be seeing an old set of parameters from the model, and training would not work as expected.
A classical example is when the user moves the model to the GPU after having constructing the optimizer, which still points to the CPU tensors, and training doesn't happen at all.

@pietern do you think this concern is justified or do we have guarantees that DDP (in all its flavours, single or multi GPU per process) will not modify the references to the original parameters after construction?

I think it's a valid concern. DDP doesn't modify the model it wraps. It only replicates it in the case of using multiple replicas per process. But it never mutates the underlying model.

Regardless, you'll have a problem when you move it to GPU, or change precision, or do any other destructive mutation. To compartmentalize, I think it's best to just delay creating the optimizer until you've done all of this, including wrapping in DDP.

According to https://github.com/NVIDIA/apex/tree/master/examples/imagenet

To use DDP with apex.amp, the only gotcha is that model, optimizer = amp.initialize(model, optimizer, flags...) must precede model = DDP(model)

I have tested this ordering, i.e. DDP after optimizer creation & with APEX wrapping, till (almost) convergence with 4 GPUs:

Epoch: [236] [1000/1252] eta: 0:01:31 lr: 0.0003902476545030211 img/s: 788.6827109884629 loss: 0.5897 (0.6214) acc1: 85.5469 (85.1801) acc5: 94.9219 (94.7428) time: 0.3561 data: 0.0002 max mem: 14467

Next I'm testing DDP after optimizer creation but without APEX on 4 GPUs. A few initial iterations show that model is learning well.

Epoch: [0] [1300/2503] eta: 0:07:57 lr: 0.045 img/s: 340.7334779101479 loss: 5.3542 (6.1315) acc1: 5.4688 (2.4344) acc5: 17.1875 (8.0179) time: 0.3745 data: 0.0002 max mem: 13502
Will see how it fares in the end.

Please let us know how it goes

mid way through 4-GPU training, seems like it converging just fine
Epoch: [39] [1300/2503] eta: 0:07:41 lr: 0.020883504974794645 img/s: 345.68365366660873 loss: 1.3453 (1.3118) acc1: 67.1875 (68.6227) acc5: 85.9375 (87.3265) time: 0.3707 data: 0.0002 max mem: 13502

my training was cut short at epoch 80 due to out of disk quota error.
Epoch: [80] [2500/2503] eta: 0:00:01 lr: 0.009121630871114108 img/s: 346.88059463297736 loss: 1.0024 (0.9745) acc1: 75.0000 (76.1389) acc5: 90.6250 (91.3506) time: 0.8924 data: 0.0001 max mem: 13502
but overall empirical evidence suggests that moving DDP to after optimizer creation works just fine (with or without APEX).

fmassa

Thanks!

One last question, is this a regression that was introduced in APEX recently?

vinhngx · 2019-07-22T01:19:43Z

Thanks!

One last question, is this a regression that was introduced in APEX recently?

you mean the throughput? Last time I think I missed the forward pass in the timing & throughput calculation :)

andravin · 2019-07-22T18:15:04Z

@vinhngx I think the question is, when did #1119 first appear? Was it always broken, or did a recent update to apex break it?

vinhngx · 2019-07-23T04:04:45Z

Got it thanks @andravin. This is a bug in the beginning while incorporating APEX into vision.

vinhngx and others added 6 commits May 30, 2019 16:05

adding mixed precision training with Apex

4917d1b

fix APEX default optimization level

4ec46b4

adding python version check for apex

bd62886

fix LINT errors and raise exceptions if apex not available

a6bd18e

Merge remote-tracking branch 'upstream/master'

7ba9f44

fixing apex distributed training

0acfc82

vinhngx mentioned this pull request Jul 16, 2019

torchvision classification train.py script fails with DistributedDataParallel and --apex #1119

Closed

fix throughput calculation: include forward pass

9b3c861

fmassa requested changes Jul 16, 2019

View reviewed changes

andravin reviewed Jul 16, 2019

View reviewed changes

vinhngx added 4 commits July 17, 2019 13:03

remove torch.cuda.set_device(args.gpu) as it's already called in init…

eba75dd

…_distributed_mode

fix linter: new line

1a9183c

move Apex initialization code back to the beginning of main

7bcbda5

move apex initialization to before lr_scheduler - for peace of mind. …

df5eefa

…Though, doing apex initialization after lr_scheduler seems to work fine as well

fmassa requested changes Jul 17, 2019

View reviewed changes

fmassa approved these changes Jul 19, 2019

View reviewed changes

fmassa merged commit c187c2b into pytorch:master Jul 19, 2019

fmassa mentioned this pull request Jul 25, 2019

Accuracy regression on MobileNetV2 #1172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix apex distributed training #1124

Fix apex distributed training #1124

vinhngx commented Jul 16, 2019

codecov-io commented Jul 16, 2019 •

edited

Loading

fmassa left a comment

fmassa Jul 16, 2019

vinhngx Jul 17, 2019

fmassa Jul 16, 2019

vinhngx Jul 17, 2019

fmassa Jul 16, 2019

vinhngx Jul 17, 2019

vinhngx Jul 17, 2019 •

edited

Loading

andravin Jul 17, 2019

vinhngx Jul 17, 2019

andravin Jul 16, 2019

vinhngx Jul 17, 2019

fmassa left a comment

fmassa Jul 17, 2019

pietern Jul 17, 2019

vinhngx Jul 18, 2019 •

edited

Loading

vinhngx Jul 18, 2019 •

edited

Loading

fmassa Jul 18, 2019

vinhngx Jul 18, 2019

vinhngx Jul 19, 2019

fmassa left a comment

vinhngx commented Jul 22, 2019

andravin commented Jul 22, 2019

vinhngx commented Jul 23, 2019

		@@ -80,20 +80,21 @@ def _get_cache_path(filepath):
		cache_path = os.path.expanduser(cache_path)
		return cache_path

		@@ -186,6 +182,10 @@ def main(args):
		model, optimizer = amp.initialize(model, optimizer,

Fix apex distributed training #1124

Fix apex distributed training #1124

Conversation

vinhngx commented Jul 16, 2019

codecov-io commented Jul 16, 2019 • edited Loading

Codecov Report

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinhngx Jul 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinhngx Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

vinhngx Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmassa left a comment

Choose a reason for hiding this comment

vinhngx commented Jul 22, 2019

andravin commented Jul 22, 2019

vinhngx commented Jul 23, 2019

codecov-io commented Jul 16, 2019 •

edited

Loading

vinhngx Jul 17, 2019 •

edited

Loading

vinhngx Jul 18, 2019 •

edited

Loading

vinhngx Jul 18, 2019 •

edited

Loading