Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training of CMU dataset gets stuck on batch 1 #79

Closed
Samleo8 opened this issue May 31, 2020 · 5 comments
Closed

Training of CMU dataset gets stuck on batch 1 #79

Samleo8 opened this issue May 31, 2020 · 5 comments

Comments

@Samleo8
Copy link

Samleo8 commented May 31, 2020

Hi, I'm trying to train the volumetric model on the CMU dataset, based on the train/val splits noted in issue #19. I am using 4 RTX 2080Ti GPUs.

Training is perfectly ok, but when the evaluation reaches batch 1, the entire evaluation halts and hangs for a very long time (almost a day) before I have to stop it. It is a reproducible problem: you can try running it from my forked repository here, following the CMU preprocessing instructions and running ./scripts/train_cmu.

Interestingly, if training is skipped and only evaluation is run, batch 1 takes a while (say 15 min) but eventually completes and continues. I am not sure why the problem seems to only lie with batch 1. However, when combined with training, the evaluation at batch 1 hangs consistently and indefinitely.

At first, I suspected that there was some memory issue and thus reduced the batch size to 1 (for both train and val) and num_workers to 3 and 2 respectively. It still did not solve the problem. Right now, I am testing with just skipping the batch.

However, this still does not solve the root of the problem:

  1. Did you guys encounter similar issues during your training?
  2. What do you guys think may be the actual issue here?

Thank you!

@karfly
Copy link
Owner

karfly commented May 31, 2020

Hi, @Samleo8!
We didn't encounter such problems. Looks like it's just some technical problem. I'd try to set train epoch steps to 1 and check if the problem remains. If yes, it'll let you debug the problem faster.
I believe there's some problem when switching from train mode to eval. Also, I'd try to reproduce this problem with a single GPU.

@Samleo8
Copy link
Author

Samleo8 commented Jun 1, 2020

Update: it seems that even after manually skipping batch 1, the problem propagates down to batch 2, further hinting at a memory or cache issue? I'm still perplexed as to how to fix this:

  1. Should I be setting torch.cuda.empty_cache() or something similar after training and before eval? (already being done, no use; but perhaps another cache clearing method?)
  2. Should I be setting model.share_memory() (although the [PyTorch docs](https: // pytorch.org/docs/stable/distributed.html) seem to say that this is not advised for DistributedDataParallel?)
  3. Could it be an issue with the NCCL backend? Is there a difference if I switch to gloo (which supports timeout)

I'd try to set train epoch steps to 1 and check if the problem remains. If yes, it'll let you debug the problem faster.

Thank you! Just to confirm, are you referring to setting n_objects_per_epoch: 15000 to n_objects_per_epoch: 1

@Samleo8 Samleo8 closed this as completed Jun 1, 2020
@Samleo8 Samleo8 reopened this Jun 1, 2020
@Samleo8
Copy link
Author

Samleo8 commented Jun 1, 2020

I'd try to set train epoch steps to 1 and check if the problem remains. If yes, it'll let you debug the problem faster.

Thank you! Just to confirm, are you referring to setting n_objects_per_epoch: 15000 to n_objects_per_epoch: 1

So I tried this and set n_objects_per_epoch to 10. As you suspected, it did get stuck. The more interesting part is that the unfinished training process from one of the processes overflowed into the evaluation part. In other words, while batch 8 was still training, batch 0 started evaluating. This is clearly a problem with the multi-GPU setup. How do we fix this issue? @karfly

@Samleo8
Copy link
Author

Samleo8 commented Jun 1, 2020

Update 2: I believe that the problem lies with the fact that once one of the sub-processes on one GPU finishes (and so that GPU is free), it moves on to load the eval Dataloader process instead?

Update 3: It runs fine on a single-GPU, but I would really like to train using our multiple-GPUs otherwise it'll take too long

Note possible related issue here: pytorch/pytorch#19996

@Samleo8
Copy link
Author

Samleo8 commented Jun 1, 2020

Update 4: Seems to work after upgrade to latest version of pytorch 1.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants