You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
While running distributed training, the script will work fine for 3-5 epochs, then stop running. The GPUs are still active and there is no error or stacktrace provided, but there will be no more output. I cannot tell why it's happening as I've run again and again with the same configuration and environment and the script will stop at irregular intervals. It always seems to be early on, as the latest it has hung is 5 epochs.
Reproduction
./tools/dist_train.sh /home/ec2-user/vfnetx_config.py 8
(The config file is the same as the one in the repo, I just renamed it.)
Hi @thomas-ames, thanks for reporting this problem. I think this issue you ran into is the same with this one #10 which seems to have been solved by @oym050922021 .
Hi, @oym050922021, could you please share your solution to this problem to help @thomas-ames fix it ? Thank you.
hi,
sorry, I haven't solved the problem yet.
At 2021-03-27 09:06:55, "hyz-xmaster" ***@***.***> wrote:
Hi @thomas-ames, thanks for reporting this problem. I think this issue you ran into is the same with this one #10 which seems to have been solved by @oym050922021 .
Hi, @oym050922021, could you please share your solution to this problem to help @thomas-ames fix it ? Thank you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Describe the bug
While running distributed training, the script will work fine for 3-5 epochs, then stop running. The GPUs are still active and there is no error or stacktrace provided, but there will be no more output. I cannot tell why it's happening as I've run again and again with the same configuration and environment and the script will stop at irregular intervals. It always seems to be early on, as the latest it has hung is 5 epochs.
Reproduction
./tools/dist_train.sh /home/ec2-user/vfnetx_config.py 8
(The config file is the same as the one in the repo, I just renamed it.)
I used this config: https://github.com/hyz-xmaster/VarifocalNet/blob/master/configs/vfnet/vfnetx_r2_101_fpn_mdconv_c3-c5_mstrain_59e_coco.py
The only difference was the datasets I used (custom COCO datasets)
Environment
The text was updated successfully, but these errors were encountered: