-
Notifications
You must be signed in to change notification settings - Fork 2.5k
aspect ratio grouping error #55
Comments
Hi, It is a bit difficult to understand where the problem might be without a bit more information. A few questions:
|
@fmassa Thank you very much for your quick response! |
Do you also handle the case where there are no masks present in the batch? If you have an early return from the losses and you don't backpropagate through all the model, you might face deadlocks (or maybe errors in the newest version, I don't know). This means that the loss need to be linked to the whole model, even if it is zero. |
I also use |
How do you return the loss early? return mask_logits.sum() * 0 instead of return torch.tensor(0, requires_grad=True, device=device) |
Yes, I use the code like |
It is difficult to say what else can be the problem without seeing the code.
Can you share the modifications that you did? It would be easier to help you in this case |
Also note that what I mentioned is true everywhere in the model. |
The related modified files are here: In the |
There is a mistake in this line. I have correct it but the error is the same. I do not suffer deadlocks. |
I notice another error at the top of the error logs, which may be the actual cause of this problem.
|
Oh, there might be indeed a problem with the |
That's right. When setting the
Single GPU:
|
I don't exactly know where the issue might come from, but during multi you training we mask the indices so that each GPU see a different subset of the data. |
i have same issue |
If you manage to isolate the problem with a minimal example, it would be very helpful as for now I don't know where to start looking |
Is there any update to this? Edit: When i run with multi-GPU and leave aspect_grouping on, it showed the error as follow: I am running two experiments (first with single GPU, second with single GPU and aspect_grouping off) and so far (17000 iterations) no error is encountered. |
in addition this parameter ‘ASPECT_RATIO_GROUPING’ is in the file .\maskrcnn-benchmark\maskrcnn_benchmark\config\defaults.py |
I have met the same issue when training my custom dataset with 2 GPUs. On 1 GPU, the value
On a similar dataset with multiple GPU training, I haven't this issue. It is weird. Setting |
牛逼,解决了 |
❓ Questions and Help
I added a new loss and it works fine if I use a single GPU.
However, it fails on "losses.backward()" if I use multiple GPUs. It seems this error relates to the "torch.distributed"
The error information is below:
The text was updated successfully, but these errors were encountered: