RunTimeError using multiple GPUs. #11

meera-m-t · 2022-02-28T03:58:00Z

Hello,

Thank You for this library. I have been using the mozafari.py network to train my spiking network model. Since the dataset is imagenet I wanted to use a multiGPU setup. I used the data parallel module from pytorch to train with 8 GPUs like following:

mozafari = torch.nn.DataParallel(mozafari, device_ids=[1, 7])

  File "/data-mount/spiking-CVT/SpykeTorch/snn.py", line 219, in forward
    lr[f] = torch.where(pairings[i], *(self.learning_rate[f]))
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cuda:7 and x and y are on cuda:0 and cuda:0 respectively

miladmozafari · 2022-03-31T15:15:45Z

Hello,

Sorry that I missed this issue. I am not an expert in DataParallel but I think the reason is that I have used the "Batch dimension" for simulating the time. As there are operations over the time dimension, it cannot be done over multiple GPUs. I will try to explore the problem and find a fix, but I guess it won't be an easy one!

Sorry again and thank you for reporting this important issue.

meera-m-t · 2022-03-31T15:23:06Z

my solution was making sure both of them in the same device

torch.where(torch.tensor(pairings[i],device=torch.tensor(self.learning_rate[f]).device), *(torch.tensor(self.learning_rate[f],device=torch.tensor(self.learning_rate[f]).device)))

but this is not efficient. Please let me know if you have any other suggestion . Thank you so much for replying

miladmozafari · 2022-03-31T15:47:15Z

I have an idea in my mind (as a quick temporary fix) but I cannot check if it works at the moment. Right now, the script iterates over samples in a single batch and passes them to the network one by one. What if we move this iteration into the network's forward pass? This means we pass a 5-dim input to the network which is in the shape of (batch, time, channels, w, h), then, inside the forward function, we iterate over the batch dimension and process each of them separately. Please let me know if my explanation is not clear.

tmasquelier · 2022-04-01T08:09:58Z

Hi Milad & Meera,
I think what you can do is reshape (batch, time, channels, w, h) to (batch*time, channels, w, h), then process this tensor (with Conv2d etc.), then reshape it back to (batch, time, channels, w, h).
This is how it's done in SpikingJelly.
But I'm not sure if it's compatible with winner take all?

meera-m-t · 2022-04-01T14:06:53Z

Thank you so much for all your answers!
In the beginning I tried (reshape the input + conv3d with filter (1, 5, 5) for example) but I got different winners for the same output compared to the milad code, when I set torch.manual_seed(0) . but when I tried (batched convolution2d for sequential data) similar as tmasquelier suggestion, it works. I did minor changes in the (SpykeTorch) I am still working on it. my answer here

    def forward(self, input):
        flattened = input.flatten(0, 1)
        conv2d_out = fn.conv2d(flattened, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
        return conv2d_out.reshape(input.shape[0], -1, conv2d_out.shape[1], conv2d_out.shape[2], conv2d_out.shape[3])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunTimeError using multiple GPUs. #11

RunTimeError using multiple GPUs. #11

meera-m-t commented Feb 28, 2022 •

edited

Loading

miladmozafari commented Mar 31, 2022

meera-m-t commented Mar 31, 2022 •

edited

Loading

miladmozafari commented Mar 31, 2022

tmasquelier commented Apr 1, 2022

meera-m-t commented Apr 1, 2022 •

edited

Loading

RunTimeError using multiple GPUs. #11

RunTimeError using multiple GPUs. #11

Comments

meera-m-t commented Feb 28, 2022 • edited Loading

miladmozafari commented Mar 31, 2022

meera-m-t commented Mar 31, 2022 • edited Loading

miladmozafari commented Mar 31, 2022

tmasquelier commented Apr 1, 2022

meera-m-t commented Apr 1, 2022 • edited Loading

meera-m-t commented Feb 28, 2022 •

edited

Loading

meera-m-t commented Mar 31, 2022 •

edited

Loading

meera-m-t commented Apr 1, 2022 •

edited

Loading