Training of CMU dataset gets stuck on batch 1 #79

Samleo8 · 2020-05-31T14:19:13Z

Hi, I'm trying to train the volumetric model on the CMU dataset, based on the train/val splits noted in issue #19. I am using 4 RTX 2080Ti GPUs.

Training is perfectly ok, but when the evaluation reaches batch 1, the entire evaluation halts and hangs for a very long time (almost a day) before I have to stop it. It is a reproducible problem: you can try running it from my forked repository here, following the CMU preprocessing instructions and running ./scripts/train_cmu.

Interestingly, if training is skipped and only evaluation is run, batch 1 takes a while (say 15 min) but eventually completes and continues. I am not sure why the problem seems to only lie with batch 1. However, when combined with training, the evaluation at batch 1 hangs consistently and indefinitely.

At first, I suspected that there was some memory issue and thus reduced the batch size to 1 (for both train and val) and num_workers to 3 and 2 respectively. It still did not solve the problem. Right now, I am testing with just skipping the batch.

However, this still does not solve the root of the problem:

Did you guys encounter similar issues during your training?
What do you guys think may be the actual issue here?

Thank you!

The text was updated successfully, but these errors were encountered:

karfly · 2020-05-31T16:44:05Z

Hi, @Samleo8!
We didn't encounter such problems. Looks like it's just some technical problem. I'd try to set train epoch steps to 1 and check if the problem remains. If yes, it'll let you debug the problem faster.
I believe there's some problem when switching from train mode to eval. Also, I'd try to reproduce this problem with a single GPU.

Samleo8 · 2020-06-01T03:12:48Z

Update: it seems that even after manually skipping batch 1, the problem propagates down to batch 2, further hinting at a memory or cache issue? I'm still perplexed as to how to fix this:

Should I be setting torch.cuda.empty_cache() or something similar after training and before eval? (already being done, no use; but perhaps another cache clearing method?)
Should I be setting model.share_memory() (although the [PyTorch docs](https: // pytorch.org/docs/stable/distributed.html) seem to say that this is not advised for DistributedDataParallel?)
Could it be an issue with the NCCL backend? Is there a difference if I switch to gloo (which supports timeout)

I'd try to set train epoch steps to 1 and check if the problem remains. If yes, it'll let you debug the problem faster.

Thank you! Just to confirm, are you referring to setting n_objects_per_epoch: 15000 to n_objects_per_epoch: 1

Samleo8 · 2020-06-01T04:12:27Z

I'd try to set train epoch steps to 1 and check if the problem remains. If yes, it'll let you debug the problem faster.

Thank you! Just to confirm, are you referring to setting n_objects_per_epoch: 15000 to n_objects_per_epoch: 1

So I tried this and set n_objects_per_epoch to 10. As you suspected, it did get stuck. The more interesting part is that the unfinished training process from one of the processes overflowed into the evaluation part. In other words, while batch 8 was still training, batch 0 started evaluating. This is clearly a problem with the multi-GPU setup. How do we fix this issue? @karfly

Samleo8 · 2020-06-01T05:30:16Z

Update 2: I believe that the problem lies with the fact that once one of the sub-processes on one GPU finishes (and so that GPU is free), it moves on to load the eval Dataloader process instead?

Update 3: It runs fine on a single-GPU, but I would really like to train using our multiple-GPUs otherwise it'll take too long

Note possible related issue here: pytorch/pytorch#19996

Samleo8 · 2020-06-01T06:13:16Z

Update 4: Seems to work after upgrade to latest version of pytorch 1.5.0

Samleo8 closed this as completed Jun 1, 2020

Samleo8 reopened this Jun 1, 2020

Samleo8 mentioned this issue Jun 1, 2020

OMP: Warning #190 because of fork not waiting for parallel region to end pytorch/pytorch#19996

Closed

Samleo8 closed this as completed Jun 2, 2020

Samleo8 mentioned this issue Jun 2, 2020

CUDA out of memory error (only after first epoch) #80

Closed

Samleo8 mentioned this issue Aug 28, 2020

Will CMU Panoptic dataset be added ? #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training of CMU dataset gets stuck on batch 1 #79

Training of CMU dataset gets stuck on batch 1 #79

Samleo8 commented May 31, 2020 •

edited

Loading

karfly commented May 31, 2020

Samleo8 commented Jun 1, 2020 •

edited

Loading

Samleo8 commented Jun 1, 2020

Samleo8 commented Jun 1, 2020 •

edited

Loading

Samleo8 commented Jun 1, 2020

Training of CMU dataset gets stuck on batch 1 #79

Training of CMU dataset gets stuck on batch 1 #79

Comments

Samleo8 commented May 31, 2020 • edited Loading

karfly commented May 31, 2020

Samleo8 commented Jun 1, 2020 • edited Loading

Samleo8 commented Jun 1, 2020

Samleo8 commented Jun 1, 2020 • edited Loading

Samleo8 commented Jun 1, 2020

Samleo8 commented May 31, 2020 •

edited

Loading

Samleo8 commented Jun 1, 2020 •

edited

Loading

Samleo8 commented Jun 1, 2020 •

edited

Loading