Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunTimeError using multiple GPUs. #11

Open
meera-m-t opened this issue Feb 28, 2022 · 5 comments
Open

RunTimeError using multiple GPUs. #11

meera-m-t opened this issue Feb 28, 2022 · 5 comments

Comments

@meera-m-t
Copy link

meera-m-t commented Feb 28, 2022

Hello,

Thank You for this library. I have been using the mozafari.py network to train my spiking network model. Since the dataset is imagenet I wanted to use a multiGPU setup. I used the data parallel module from pytorch to train with 8 GPUs like following:

mozafari = torch.nn.DataParallel(mozafari, device_ids=[1, 7])
  File "/data-mount/spiking-CVT/SpykeTorch/snn.py", line 219, in forward
    lr[f] = torch.where(pairings[i], *(self.learning_rate[f]))
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cuda:7 and x and y are on cuda:0 and cuda:0 respectively
@miladmozafari
Copy link
Owner

Hello,

Sorry that I missed this issue. I am not an expert in DataParallel but I think the reason is that I have used the "Batch dimension" for simulating the time. As there are operations over the time dimension, it cannot be done over multiple GPUs. I will try to explore the problem and find a fix, but I guess it won't be an easy one!

Sorry again and thank you for reporting this important issue.

@meera-m-t
Copy link
Author

meera-m-t commented Mar 31, 2022

my solution was making sure both of them in the same device

torch.where(torch.tensor(pairings[i],device=torch.tensor(self.learning_rate[f]).device), *(torch.tensor(self.learning_rate[f],device=torch.tensor(self.learning_rate[f]).device)))  

but this is not efficient. Please let me know if you have any other suggestion . Thank you so much for replying

@miladmozafari
Copy link
Owner

I have an idea in my mind (as a quick temporary fix) but I cannot check if it works at the moment. Right now, the script iterates over samples in a single batch and passes them to the network one by one. What if we move this iteration into the network's forward pass? This means we pass a 5-dim input to the network which is in the shape of (batch, time, channels, w, h), then, inside the forward function, we iterate over the batch dimension and process each of them separately. Please let me know if my explanation is not clear.

@tmasquelier
Copy link

Hi Milad & Meera,
I think what you can do is reshape (batch, time, channels, w, h) to (batch*time, channels, w, h), then process this tensor (with Conv2d etc.), then reshape it back to (batch, time, channels, w, h).
This is how it's done in SpikingJelly.
But I'm not sure if it's compatible with winner take all?

@meera-m-t
Copy link
Author

meera-m-t commented Apr 1, 2022

Thank you so much for all your answers!
In the beginning I tried (reshape the input + conv3d with filter (1, 5, 5) for example) but I got different winners for the same output compared to the milad code, when I set torch.manual_seed(0) . but when I tried (batched convolution2d for sequential data) similar as tmasquelier suggestion, it works. I did minor changes in the (SpykeTorch) I am still working on it. my answer here

    def forward(self, input):
        flattened = input.flatten(0, 1)
        conv2d_out = fn.conv2d(flattened, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
        return conv2d_out.reshape(input.shape[0], -1, conv2d_out.shape[1], conv2d_out.shape[2], conv2d_out.shape[3])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants