-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training of CMU dataset gets stuck on batch 1 #79
Comments
Hi, @Samleo8! |
Update: it seems that even after manually skipping batch 1, the problem propagates down to batch 2, further hinting at a memory or cache issue? I'm still perplexed as to how to fix this:
Thank you! Just to confirm, are you referring to setting |
So I tried this and set |
Update 2: I believe that the problem lies with the fact that once one of the sub-processes on one GPU finishes (and so that GPU is free), it moves on to load the eval Dataloader process instead? Update 3: It runs fine on a single-GPU, but I would really like to train using our multiple-GPUs otherwise it'll take too long Note possible related issue here: pytorch/pytorch#19996 |
Update 4: Seems to work after upgrade to latest version of pytorch 1.5.0 |
Hi, I'm trying to train the volumetric model on the CMU dataset, based on the train/val splits noted in issue #19. I am using 4 RTX 2080Ti GPUs.
Training is perfectly ok, but when the evaluation reaches batch 1, the entire evaluation halts and hangs for a very long time (almost a day) before I have to stop it. It is a reproducible problem: you can try running it from my forked repository here, following the CMU preprocessing instructions and running
./scripts/train_cmu
.Interestingly, if training is skipped and only evaluation is run, batch 1 takes a while (say 15 min) but eventually completes and continues. I am not sure why the problem seems to only lie with batch 1. However, when combined with training, the evaluation at batch 1 hangs consistently and indefinitely.
At first, I suspected that there was some memory issue and thus reduced the batch size to 1 (for both train and val) and num_workers to 3 and 2 respectively. It still did not solve the problem. Right now, I am testing with just skipping the batch.
However, this still does not solve the root of the problem:
Thank you!
The text was updated successfully, but these errors were encountered: