Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

aspect ratio grouping error #55

Open
MhLiao opened this issue Oct 29, 2018 · 21 comments
Open

aspect ratio grouping error #55

MhLiao opened this issue Oct 29, 2018 · 21 comments
Labels
bug Something isn't working contributions welcome

Comments

@MhLiao
Copy link

MhLiao commented Oct 29, 2018

❓ Questions and Help

I added a new loss and it works fine if I use a single GPU.
However, it fails on "losses.backward()" if I use multiple GPUs. It seems this error relates to the "torch.distributed"
The error information is below:

File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/home/maskrcnn_benchmark/engine/trainer.py", line 77, in do_train
    losses.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/deprecated/distributed.py", line 342, in reduction_fn_nccl
    group=self.nccl_reduction_group_id)
  File "/usr/local/lib/python3.5/dist-packages/torch/distributed/deprecated/__init__.py", line 317, in all_reduce_multigpu
    return torch._C._dist_all_reduce_multigpu(tensor_list, op, group)
@fmassa fmassa added the question Further information is requested label Oct 29, 2018
@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

Hi,

It is a bit difficult to understand where the problem might be without a bit more information.

A few questions:

  • is your loss written in Python using only PyTorch operations?
  • do you create a new tensor during your loss, and do you make sure you set the device of it properly?
  • does your loss involve double-backwards?
  • is it a loss (like MSE-loss) or a new FasterRCNNLossComputation class (or alike) that you wrote?

@MhLiao
Copy link
Author

MhLiao commented Oct 29, 2018

@fmassa Thank you very much for your quick response!
The new loss is a "F.cross_entropy" for another mask prediction branch. I create a new target tensor for the loss and set its device as project_masks_on_boxes() in maskrcnn_benchmark/modeling/roi_heads/mask_head/loss.py does.

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

Do you also handle the case where there are no masks present in the batch?

If you have an early return from the losses and you don't backpropagate through all the model, you might face deadlocks (or maybe errors in the newest version, I don't know).
Thats why I have parts like the following in the code https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/roi_heads/mask_head/loss.py#L124-L125

This means that the loss need to be linked to the whole model, even if it is zero.

@MhLiao
Copy link
Author

MhLiao commented Oct 29, 2018

I also use if mask_targets.numel() == 0: to return the loss early. I can run it with a single GPU stablely so I guess the problem is related with the "torch.distributed". Maybe I should register the new loss function or modify some code about the distributed code?

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

How do you return the loss early?
It should be something like

return mask_logits.sum() * 0

instead of

return torch.tensor(0, requires_grad=True, device=device)

@MhLiao
Copy link
Author

MhLiao commented Oct 29, 2018

Yes, I use the code like return mask_logits.sum() * 0.

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

It is difficult to say what else can be the problem without seeing the code.
If it works on single GPU, but fails on multi-GPU, the possibilities that I can think of are the following:

  • you selectively return one loss or the other depending on the batch: you need to return all losses, even if one is zero, with the approach I mentioned to you before. So if one branch is not used, you still need to make its loss be mask_logits.sum() * 0 or something like that.

Can you share the modifications that you did? It would be easier to help you in this case

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

Also note that what I mentioned is true everywhere in the model.
So if somewhere else in your model you have an early return (for example in the box_heads.py), you should make sure that everything gets linked by the model (via a forward / backward), or else you might face deadlocks in NCCL.

@MhLiao
Copy link
Author

MhLiao commented Oct 29, 2018

The related modified files are here:
https://github.com/MhLiao/debug/blob/master/loss.py
https://github.com/MhLiao/debug/blob/master/mask_head.py

In the loss.py, I add a cross_entropy loss function and keep the steps almost the same as the original loss.
I return a dict with two keys in the mask_head.py instead of the original one-key dict.

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

@MhLiao can you try changing this line with

if mask_targets.numel() == 0 or char_mask_targets.numel() == 0 :

and let me know?

@MhLiao
Copy link
Author

MhLiao commented Oct 29, 2018

There is a mistake in this line. I have correct it but the error is the same. I do not suffer deadlocks.

@MhLiao
Copy link
Author

MhLiao commented Oct 29, 2018

I notice another error at the top of the error logs, which may be the actual cause of this problem.
There may be something wrong in the data sampler when I use multiple GPUs.

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/engine/trainer.py", line 56, in do_train
    for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 614, in __next__
    indices = next(self.sample_iter)  # may raise StopIteration
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/iteration_based_batch_sampler.py", line 24, in __iter__
    for batch in self.batch_sampler:
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/grouped_batch_sampler.py", line 107, in __iter__
    batches = self._prepare_batches()
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/grouped_batch_sampler.py", line 79, in _prepare_batches
    first_element_of_batch = [t[0].item() for t in merged]
  File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark
/data/samplers/grouped_batch_sampler.py", line 79, in <listcomp>
    first_element_of_batch = [t[0].item() for t in merged]
IndexError: index 0 is out of bounds for dimension 0 with size 0

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

@fmassa fmassa added bug Something isn't working and removed question Further information is requested labels Oct 29, 2018
@MhLiao
Copy link
Author

MhLiao commented Oct 30, 2018

That's right. When setting the ASPECT_RATIO_GROUPING to False, everything is OK.
I print the value of merged in this line
But I can not find any differences between using a single GPU and using multiple GPUs.
multple GPUs:

(tensor([9]), tensor([187]), tensor([63]), tensor([48]), tensor([159]), tensor([172]), tensor([176]), tensor([75]), tensor(
[221]), tensor([131]), tensor([56]), tensor([191]), tensor([99]), tensor([46]), tensor([80]), tensor([124]), tensor([161]),
 tensor([184]), tensor([166]), tensor([141]), tensor([155]), tensor([175]), tensor([214]), tensor([89]), tensor([93]), tens
or([144]), tensor([64]), tensor([69]), tensor([174]))
(tensor([109]), tensor([200]), tensor([211]), tensor([189]), tensor([17]), tensor([59]), tensor([104]), tensor([31]), tenso
r([180]), tensor([137]), tensor([51]), tensor([5]), tensor([183]), tensor([44]), tensor([60]), tensor([138]), tensor([158])
, tensor([15]), tensor([185]), tensor([30]), tensor([142]), tensor([204]), tensor([216]), tensor([206]), tensor([190]), ten
sor([165]), tensor([164]), tensor([24]), tensor([111]))
(tensor([122]), tensor([121]), tensor([209]), tensor([133]), tensor([162]), tensor([81]), tensor([227]), tensor([128]), ten
sor([57]), tensor([68]), tensor([218]), tensor([169]), tensor([21]), tensor([149]), tensor([47]), tensor([156]), tensor([8]
), tensor([148]), tensor([18]), tensor([207]), tensor([62]), tensor([210]), tensor([73]), tensor([12]), tensor([192]), tens
or([103]), tensor([96]), tensor([107]), tensor([152]))
(tensor([123]), tensor([130]), tensor([113]), tensor([153]), tensor([32]), tensor([181]), tensor([170]), tensor([222]), ten
sor([7]), tensor([115]), tensor([91]), tensor([61]), tensor([199]), tensor([43]), tensor([22]), tensor([19]), tensor([26]),
 tensor([145]), tensor([49]), tensor([127]), tensor([88]), tensor([28]), tensor([53]), tensor([208]), tensor([114]), tensor
([100]), tensor([194]), tensor([215]), tensor([39]))
(tensor([114]), tensor([100]), tensor([194]), tensor([151]), tensor([92]), tensor([224]), tensor([219]), tensor([182]), ten
sor([116]), tensor([72]), tensor([87]), tensor([71]), tensor([90]), tensor([52]), tensor([117]), tensor([27]), tensor([157]
), tensor([45]), tensor([97]), tensor([112]), tensor([220]), tensor([140]), tensor([84]), tensor([193]), tensor([173]), ten
sor([78]), tensor([34]), tensor([226]), tensor([79]), tensor([], dtype=torch.int64))
(tensor([177]), tensor([106]), tensor([14]), tensor([203]), tensor([83]), tensor([205]), tensor([74]), tensor([129]), tenso
r([86]), tensor([38]), tensor([225]), tensor([201]), tensor([147]), tensor([120]), tensor([101]), tensor([217]), tensor([20
]), tensor([160]), tensor([23]), tensor([29]), tensor([6]), tensor([65]), tensor([212]), tensor([171]), tensor([198]), tens
or([40]), tensor([10]), tensor([94]), tensor([126]))
(tensor([146]), tensor([167]), tensor([95]), tensor([2]), tensor([36]), tensor([3]), tensor([35]), tensor([119]), tensor([4
2]), tensor([41]), tensor([1]), tensor([82]), tensor([228]), tensor([143]), tensor([196]), tensor([50]), tensor([33]), tens
or([195]), tensor([202]), tensor([54]), tensor([150]), tensor([58]), tensor([0]), tensor([16]), tensor([135]), tensor([125]
), tensor([188]), tensor([163]), tensor([108]))
(tensor([197]), tensor([37]), tensor([178]), tensor([118]), tensor([98]), tensor([4]), tensor([67]), tensor([136]), tensor(
[132]), tensor([168]), tensor([186]), tensor([77]), tensor([13]), tensor([223]), tensor([11]), tensor([134]), tensor([66]),
 tensor([179]), tensor([55]), tensor([70]), tensor([154]), tensor([102]), tensor([213]), tensor([110]), tensor([76]), tenso
r([139]), tensor([105]), tensor([25]), tensor([85]))

Single GPU:

(tensor([67]), tensor([104]), tensor([44]), tensor([59]), tensor([190]), tensor([187]), tensor([12]), tensor([65]), tensor(
[2]), tensor([26]), tensor([92]), tensor([221]), tensor([198]), tensor([34]), tensor([32]), tensor([61]), tensor([71]), ten
sor([156]), tensor([131]), tensor([178]), tensor([49]), tensor([121]), tensor([136]), tensor([188]), tensor([135]), tensor(
[123]), tensor([64]), tensor([179]), tensor([142]), tensor([83]), tensor([79]), tensor([109]), tensor([127]), tensor([48]),
 tensor([11]), tensor([163]), tensor([118]), tensor([52]), tensor([66]), tensor([170]), tensor([84]), tensor([63]), tensor(
[186]), tensor([87]), tensor([96]), tensor([207]), tensor([195]), tensor([191]), tensor([103]), tensor([211]), tensor([101]
), tensor([138]), tensor([75]), tensor([114]), tensor([20]), tensor([201]), tensor([143]), tensor([141]), tensor([177]), te
nsor([76]), tensor([95]), tensor([113]), tensor([112]), tensor([51]), tensor([23]), tensor([46]), tensor([157]), tensor([19
6]), tensor([228]), tensor([199]), tensor([153]), tensor([145]), tensor([205]), tensor([159]), tensor([45]), tensor([9]), t
ensor([224]), tensor([4]), tensor([144]), tensor([100]), tensor([81]), tensor([214]), tensor([154]), tensor([173]), tensor(
[150]), tensor([7]), tensor([91]), tensor([42]), tensor([184]), tensor([164]), tensor([213]), tensor([62]), tensor([115]),
tensor([53]), tensor([148]), tensor([18]), tensor([110]), tensor([133]), tensor([89]), tensor([47]), tensor([158]), tensor(
[200]), tensor([217]), tensor([220]), tensor([194]), tensor([5]), tensor([175]), tensor([226]), tensor([28]), tensor([222])
, tensor([19]), tensor([29]), tensor([146]), tensor([82]), tensor([204]), tensor([60]), tensor([15]), tensor([165]), tensor
([192]), tensor([223]), tensor([202]), tensor([90]), tensor([203]), tensor([225]), tensor([68]), tensor([216]), tensor([30]
), tensor([149]), tensor([209]), tensor([210]), tensor([77]), tensor([6]), tensor([193]), tensor([116]), tensor([78]), tens
or([122]), tensor([147]), tensor([168]), tensor([180]), tensor([160]), tensor([128]), tensor([72]), tensor([93]), tensor([2
2]), tensor([55]), tensor([139]), tensor([13]), tensor([182]), tensor([212]), tensor([73]), tensor([10]), tensor([130]), te
nsor([137]), tensor([98]), tensor([183]), tensor([86]), tensor([125]), tensor([151]), tensor([169]), tensor([197]), tensor(
[107]), tensor([172]), tensor([161]), tensor([124]), tensor([102]), tensor([41]), tensor([185]), tensor([132]), tensor([140
]), tensor([35]), tensor([57]), tensor([166]), tensor([181]), tensor([40]), tensor([50]), tensor([88]), tensor([227]), tens
or([74]), tensor([58]), tensor([97]), tensor([208]), tensor([56]), tensor([176]), tensor([36]), tensor([206]), tensor([171]
), tensor([33]), tensor([117]), tensor([105]), tensor([155]), tensor([17]), tensor([219]), tensor([54]), tensor([70]), tens
or([21]), tensor([16]), tensor([43]), tensor([129]), tensor([119]), tensor([167]), tensor([0]), tensor([80]), tensor([120])
, tensor([38]), tensor([1]), tensor([189]), tensor([218]), tensor([106]), tensor([99]), tensor([27]), tensor([162]), tensor
([37]), tensor([3]), tensor([8]), tensor([134]), tensor([31]), tensor([14]), tensor([152]), tensor([111]), tensor([25]), te
nsor([85]), tensor([69]), tensor([24]), tensor([39]), tensor([174]), tensor([108]), tensor([215]), tensor([126]), tensor([9
4]))

@fmassa
Copy link
Contributor

fmassa commented Oct 30, 2018

I don't exactly know where the issue might come from, but during multi you training we mask the indices so that each GPU see a different subset of the data.
Maybe there is an edge case there that I'm not taking into account. I'll need to investigate further

@gxd1994
Copy link

gxd1994 commented Jan 24, 2019

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

i have same issue

@MhLiao MhLiao changed the title Add a new loss but get an error when using multiple gpus aspect ratio grouping error Jan 24, 2019
@fmassa
Copy link
Contributor

fmassa commented Jan 24, 2019

If you manage to isolate the problem with a minimal example, it would be very helpful as for now I don't know where to start looking

@mikelam14
Copy link

mikelam14 commented Apr 4, 2019

Is there any update to this?

Edit: When i run with multi-GPU and leave aspect_grouping on, it showed the error as follow:
/data/samplers/grouped_batch_sampler.py", line 79, in _prepare_batches first_element_of_batch = [t[0].item() for t in merged] File "/unsullied/sharefs/_csg_algorithm/Interns/liaominghui/data/masktextspotter/maskrcnn-benchmark/maskrcnn_benchmark /data/samplers/grouped_batch_sampler.py", line 79, in <listcomp> first_element_of_batch = [t[0].item() for t in merged] IndexError: index 0 is out of bounds for dimension 0 with size 0

I am running two experiments (first with single GPU, second with single GPU and aspect_grouping off) and so far (17000 iterations) no error is encountered.

@zsc1220
Copy link

zsc1220 commented Aug 26, 2019

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

in addition this parameter ‘ASPECT_RATIO_GROUPING’ is in the file .\maskrcnn-benchmark\maskrcnn_benchmark\config\defaults.py

@jdhao
Copy link

jdhao commented Dec 28, 2019

I have met the same issue when training my custom dataset with 2 GPUs. On 1 GPU, the value merged is normal, but on 2nd gpu, there is an empty tensor in merged:

rank: 0, type(merged): <class 'tuple'>, len(merged): 125
rank: 1, type(merged): <class 'tuple'>, len(merged): 126
(tensor([228, 496]), tensor([355, 383]), tensor([465, 116]), tensor([169, 150]), tensor([324, 212]), tensor([394,   2]), tensor([238,  84]), tensor([471, 245]), tensor([411, 125]), tensor([231, 128]), tensor([316, 184]), tensor([88, 79]), tensor([267, 482]), tensor([ 90, 336]), tensor([414, 137]), tensor([29, 33]), tensor([40, 97]), tensor([ 77, 200]), tensor([58, 96]), tensor([190, 287]), tensor([165, 356]), tensor([ 98, 172]), tensor([282, 454]), tensor([ 39, 342]), tensor([149, 152]), tensor([492,  56]), tensor([175, 138]), tensor([345, 257]), tensor([358, 403]), tensor([189, 106]), tensor([ 75, 307]), tensor([ 80, 359]), tensor([338, 205]), tensor([181, 129]), tensor([301, 221]), tensor([ 13, 232]), tensor([182, 313]), tensor([ 83, 340]), tensor([ 15, 333]), tensor([350, 397]), tensor([490, 208]), tensor([  1, 216]), tensor([230, 368]), tensor([357,  62]), tensor([199, 151]), tensor([335, 332]), tensor([ 67, 135]), tensor([239, 253]), tensor([102, 132]), tensor([277, 323]), tensor([ 99, 275]), tensor([163, 286]), tensor([265, 447]), tensor([276, 448]), tensor([153, 249]), tensor([ 52, 193]), tensor([ 66, 421]), tensor([73, 18]), tensor([270, 177]), tensor([ 54, 269]), tensor([429, 296]), tensor([360, 422]), tensor([327, 481]), tensor([449, 386]), tensor([486, 180]), tensor([406, 312]), tensor([134, 387]), tensor([211, 480]), tensor([46, 68]), tensor([ 35, 235]), tensor([72, 53]), tensor([ 71, 101]), tensor([244, 161]), tensor([ 48, 466]), tensor([ 23, 168]), tensor([154, 197]), tensor([464, 436]), tensor([120, 372]), tensor([63, 85]), tensor([ 31, 272]), tensor([279, 110]), tensor([179, 317]), tensor([370, 404]), tensor([380, 401]), tensor([ 87, 437]), tensor([413, 477]), tensor([155, 311]), tensor([  8, 443]), tensor([469, 218]), tensor([405, 415]), tensor([241, 251]), tensor([ 17, 305]), tensor([183, 364]), tensor([104, 304]), tensor([331, 322]), tensor([113, 111]), tensor([ 60, 130]), tensor([297, 157]), tensor([474, 487]), tensor([407, 426]), tensor([227,   9]), tensor([363, 434]), tensor([460, 424]), tensor([431, 337]), tensor([281, 159]), tensor([32,  7]), tensor([475, 488]), tensor([ 55, 295]), tensor([220, 293]), tensor([146, 451]), tensor([385,   6]), tensor([224,  34]), tensor([348, 167]), tensor([395, 427]), tensor([366, 278]), tensor([141, 484]), tensor([369, 213]), tensor([410, 377]), tensor([463,  19]), tensor([351, 346]), tensor([362,  24]), tensor([103,  81]), tensor([352, 491]), tensor([145, 318]), tensor([59]))
(tensor([349, 396]), tensor([25, 82]), tensor([391, 248]), tensor([115, 450]), tensor([440, 124]), tensor([156, 389]), tensor([334, 166]), tensor([259, 271]), tensor([107, 176]), tensor([126,  49]), tensor([ 89, 143]), tensor([420, 388]), tensor([258, 384]), tensor([ 61, 341]), tensor([185, 247]), tensor([419, 290]), tensor([428, 162]), tensor([198, 382]), tensor([472, 347]), tensor([ 94, 192]), tensor([326, 237]), tensor([289, 148]), tensor([444, 459]), tensor([303, 409]), tensor([343, 374]), tensor([456, 204]), tensor([ 37, 376]), tensor([393,   0]), tensor([91, 65]), tensor([164, 186]), tensor([261, 329]), tensor([441, 268]), tensor([ 78, 108]), tensor([252, 430]), tensor([320, 105]), tensor([274, 207]), tensor([206, 226]), tensor([461, 173]), tensor([ 30, 242]), tensor([ 76, 122]), tensor([256, 412]), tensor([273, 294]), tensor([209, 196]), tensor([321, 123]), tensor([ 64, 119]), tensor([ 44, 371]), tensor([489, 435]), tensor([285, 147]), tensor([392,  27]), tensor([ 14, 300]), tensor([375, 240]), tensor([280,  36]), tensor([92, 12]), tensor([353, 446]), tensor([402,  22]), tensor([478, 442]), tensor([158, 479]), tensor([263, 339]), tensor([308, 390]), tensor([325, 373]), tensor([314,  41]), tensor([188, 117]), tensor([109, 400]), tensor([142, 178]), tensor([418, 191]), tensor([458, 476]), tensor([445, 423]), tensor([365, 260]), tensor([470, 136]), tensor([399, 233]), tensor([ 69, 398]), tensor([319, 222]), tensor([194, 379]), tensor([250, 495]), tensor([133, 202]), tensor([225, 298]), tensor([195, 234]), tensor([170, 330]), tensor([416, 433]), tensor([361,  10]), tensor([284, 378]), tensor([  3, 219]), tensor([467,  26]), tensor([ 93, 439]), tensor([100, 174]), tensor([462, 288]), tensor([243, 160]), tensor([ 86, 140]), tensor([ 74, 215]), tensor([283, 408]), tensor([171,  16]), tensor([302, 187]), tensor([309,  43]), tensor([112,  21]), tensor([344, 494]), tensor([485, 310]), tensor([291, 457]), tensor([381, 328]), tensor([432, 425]), tensor([417, 264]), tensor([266,  51]), tensor([114,  11]), tensor([  5, 255]), tensor([473,  70]), tensor([236,  50]), tensor([121,  95]), tensor([367, 455]), tensor([229, 306]), tensor([299,  42]), tensor([292, 139]), tensor([223,  45]), tensor([214, 493]), tensor([354, 203]), tensor([246, 210]), tensor([468, 217]), tensor([452, 118]), tensor([127, 144]), tensor([47, 20]), tensor([  4, 453]), tensor([38, 28]), tensor([ 57, 315]), tensor([438, 254]), tensor([201, 262]), tensor([483, 131]), tensor([228]), tensor([], dtype=torch.int64))

On a similar dataset with multiple GPU training, I haven't this issue. It is weird.

Setting ASPECT_RATIO_GROUPING to false in config.yml seems to fix this issue.

@Dawn-LX
Copy link

Dawn-LX commented Sep 4, 2020

Oh, there might be indeed a problem with the GroupedBatchSampler.
As a quick workaround, I'd recommend setting the ASPECT_RATIO_GROUPING to False in the config.
I'll need to dig a bit further to identify in which contexts the issue you are facing arises.

牛逼,解决了

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working contributions welcome
Projects
None yet
Development

No branches or pull requests

7 participants