-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data parallel error with O2 and not O1 #227
Comments
I'm running into the same error for O0, O2, O3: O1 is working as expected. |
Same. |
Same here, at least for O2. O1 does work. |
In general we strongly recommend That being said, I don't think DataParallel is fundamentally incompatible with Amp control flow. I see one potential problem with your code above: you are calling .cuda on the model after it's been returned from
Try this and let me know if it works. The fact that cuda-ing your model should occur before |
@mcarilli I tried the fix above but it still produces the same error: I will transition to distributed data parallel |
Historically we only test with DistributedDataParallel because performance tends to be better, but the dataset sharing issue raised by @seongwook-ham in #269 is a compelling use case. @ptrblck and I will look into it. Current to-do list is better fused optimizers, checkpointing, sparse gradients, and then DataParallel, so it may be a couple weeks before I can give it undivided attention. |
i find that old api(FP16_Optimizer) works well with nn.dataparalllel |
Please don't use the old FP16_Optimizer API, that may break at any time. It might already be broken. In general, O1 is preferable over O2 anyway, so if O1 works with DataParallel currently, using DataParallel + O1 with the current API is a much better workaround than DataParallel + reverting to the old FP16_Optimizer. We do want to support O2 + DataParallel but haven't gotten a chance to look at it yet. @seongwook-ham I am interested to hear if people have compelling reasons for requiring O2 (or FP16_Optimizer) over O1, because O1 is safer in general. Is O1 significantly slower? |
yes new api o1 is significantly slower than old api with FP16_Optimizer and half in nn.dataparallel case. |
Same |
1 similar comment
Same |
@mcarilli Unfortunately I'm using FusedAdam, which requires O2. Seems a deadlock so I have to revert to FP16_Optimizer... |
@mcarilli thanks |
I meet the same problem, sad |
same issue |
In my case, torch.nn.parallel.DistributedDataParallel doesn't work except O1 |
Same issue with DataParallel and O2, for O1 I get a |
Same |
There's a version of Distributed Data Parallel that acts like DP on a node and DDP across nodes. (https://pytorch-lightning.readthedocs.io/en/latest/Trainer/Distributed%20training/#distributeddataparallel-2-ddp2). However, this is incompatible with apex because of the issue above. What happens is that the casting done here has a bug:
After digging into the code it looks like the forward call is being patch as a "convenience" to cast inputs to .half() and back to .float32 for the outputs. A good alternative might be to remove this patching and detect 16bit in PyTorch to do the casting there. This would avoid any patching they do on forward as well. 187 for model in models:
188 import pdb
189 pdb.set_trace()
190 # Patch the forward method to cast incoming data to the correct type, and
191 # outgoing data to float32, so "the user never needs to call .half()."
192 # I like writing things explicitly more than decorators.
193 -> def patch_forward(old_fwd):
194 def new_fwd(*args, **kwargs):
195 output = old_fwd(*applier(args, input_caster),
196 **applier(kwargs, input_caster))
197 return applier(output, output_caster)
198 return new_fwd
199
200 model.forward = patch_forward(model.forward) The workaround I'm using in Lightning right now is to do this: def training_step(self, batch, batch_nb):
x, y = batch
if self.trainer.use_amp:
x = x.half()
y = y.half()
# process the reset without using forward()
out = self.model(x)
... Whereas normally I'd do this: def training_step(self, batch, batch_nb):
x, y = batch
# process the input without using forward()
out = self.forward(x) # < ------------ ONLY CHANGE + NO CASTING
... |
the same problem |
Same problem with level O0 and |
I have the same problem. O1 works well, but the ohters not. |
Any solution to the problem? |
Now I use the old method called "FP16_optimizer" to solve this problem temporarily. |
@ewrfcas I saw that solution but it was not working for me. Is there any specific version of Apex I should install to use |
FP16_optimizer is in the apex.contrib now, and you can simply copy the code from the git and use it directly. |
The problem seems to be that DataParallel replication mechanism doesn't seem to work well with forward method patching. The patched method seems to still refer to the singular model copy (via referring the old forward method which references old "self"), not to the replica, to which it should apply, hence type mismatch. It seems everything would work if forward patching to do the tensor casting is done after DP initialization. |
Workaround: model = apex.amp.initialize(torch.nn.Sequential(model), opt_level = 'O2')[0]
model = torch.nn.DataParallel(model, device_ids = args.devices)
model.forward = lambda *args, old_fwd = model.forward, input_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_type']), output_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_outputs'] if apex.amp._amp_state.opt_properties.options.get('cast_model_outputs') is not None else torch.float32), **kwargs: apex.amp._initialize.applier(old_fwd(*apex.amp._initialize.applier(args, input_caster), **apex.amp._initialize.applier(kwargs, input_caster)), output_caster) In case of DataParallel, forward must be patched after DataParallel(...) call. |
RIght now I'm working hard on native Pytorch support for mixed precision which will accommodate DistributedDataParallel, DataParallel, and model parallel training, targeting the 1.5 release. Apex as a source for mixed precision is not a future-proof path, it's annoying for people to install something separate. If Apex helps, that's great, but the sooner we get something that's packaged and tested as a native component of Pytorch, the better. If Apex does not work for you currently, my best advice is to wait for the upstream support. See #269 (comment). |
Problem solved. In my case, problem was I was passing the model through |
@phosseini What are your APEX and Torch versions? |
@vadimkantorov Thank you so much! The workaround you proposed works well in my issue! This ideed help me a lot. |
When using O2, data parallel does not work:
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
however with O1, everything works just fine.
The text was updated successfully, but these errors were encountered: