Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataParallel doesn't work #1

Open
gerwang opened this issue Aug 12, 2019 · 0 comments
Open

DataParallel doesn't work #1

gerwang opened this issue Aug 12, 2019 · 0 comments

Comments

@gerwang
Copy link

gerwang commented Aug 12, 2019

Because the operator functions are dynamic attributes, when I use the model and convert it to nn.DataParallel, the following error will arise during forwarding:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-f58217f83c02> in <module>()
     28                           shapedata=shapedata,
     29                           metadata_dir=checkpoint_path, samples_dir=samples_path,
---> 30                           checkpoint_path = args['checkpoint_file'])

/mnt/Data2/jingwang/git-task/Neural3DMM/train_funcs.py in train_autoencoder_dataloader(dataloader_train, dataloader_val, device, model, optim, loss_fn, bsize, start_epoch, n_epochs, eval_freq, scheduler, writer, save_recons, shapedata, metadata_dir, samples_dir, checkpoint_path)
     25             cur_bsize = tx.shape[0]
     26 
---> 27             tx_hat = model(tx)
     28             loss = loss_fn(tx, tx_hat)
     29 

${HOME}/Data/anaconda3/envs/py2-spiral/lib/python2.7/site-packages/torch/nn/modules/module.pyc in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

${HOME}/Data/anaconda3/envs/py2-spiral/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.pyc in forward(self, *inputs, **kwargs)
    150             return self.module(*inputs[0], **kwargs[0])
    151         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 152         outputs = self.parallel_apply(replicas, inputs, kwargs)
    153         return self.gather(outputs, self.output_device)
    154 

${HOME}/Data/anaconda3/envs/py2-spiral/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.pyc in parallel_apply(self, replicas, inputs, kwargs)
    160 
    161     def parallel_apply(self, replicas, inputs, kwargs):
--> 162         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    163 
    164     def gather(self, outputs, output_device):

${HOME}/Data/anaconda3/envs/py2-spiral/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.pyc in parallel_apply(modules, inputs, kwargs_tup, devices)
     81         output = results[i]
     82         if isinstance(output, Exception):
---> 83             raise output
     84         outputs.append(output)
     85     return outputs

RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:1 and input b is on cuda:0

It seems that this behavior is explained in this issue: pytorch/pytorch#8637 .

I wonder if there are any ways to bypass the issue in pytorch, and allow us to use DataParallel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant