-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Program turns into zombie process when killed using ctrl-c
#659
Comments
How did you launch your 2 GPU job? This behavior is not expected. |
Also, I just noticed that you have two different GPUs. What might be happening is that the fastest GPU is waiting for the slowest GPU to finish its iteration. It seems that 2080Ti do not have peer2peer enabled, which can make multi-GPU training much slower as memory transfer between GPUs should pass via the CPU https://www.pugetsystems.com/labs/hpc/P2P-peer-to-peer-on-NVIDIA-RTX-2080Ti-vs-GTX-1080Ti-GPUs-1331/ |
I reinstalled NVIDIA driver and installed the latest pytorch-nightly and the problem disappears. |
@fmassa My previous assessment of the problem was wrong. The actual problem is that the program turns into zombie process often when I I run my GPU job using the command
One of the config files I used is as such
I have tried testing with other configs as well and the problem remains. I am quite sure there is a bug in the code because this has happened in 2 different computers (I tried running it on AWS using 2x P100s as well). Environment on AWS
I thought I have solved it but apparently not. |
ctrl-c
ctrl-c
ctrl-c
This is a problem with the cleanup in PyTorch distributed launch utility, when one of the process dies the others might not be killed. ccing @pietern to know if he has ideas on how to avoid this situation. |
If you use |
@chengyangfu My expectation is that the I was browsing through the issues and seems like this issue #58 is related to the problem discussed here. Their root problem is probably the same where the coordination and communication among the many launched processes are problematic. |
I meet the same problem |
Same here |
I met a similar problem. I trained the model with 4 gpus. Training for thousand mini-batches, one process dead (I cannot get when and how it dead) and the utilization of the other three gpus are maintained at 100%, but the training has been stopped.
As shown above, the process which pid should be 65247 has been killed for some reason, How should I fix this problem? I cannot reinstall nvidia driver because of no root right. |
@Marcovaldong This is not related to the zombie process problem tracked in this issue. What you're seeing is that a single process crashing causes the remaining processes to launch NCCL kernels that will never complete. This is a known problem with NCCL and has been addressed in the most recent minor release (2.4). There is work in progress to add the error detection to the NCCL bindings in PyTorch in pytorch/pytorch#22907. Once that is done and merged, the remaining processes will raise an error once one of its peers is no longer reachable or has crashed. |
@pietern Thanks for your reply. I have fixed my problem. There is a dirty sample in my 700k train dataset, I have checked out it. |
I'm still having this issue in 2022. It occurs when my training process goes awry and a tensor of NaN values is fed to torch.nn.functional.binary_cross_entropy. I then have to close the terminal window and cannot kill the resulting zombie process. The only solution seems to be to restart the server. p.s. training with two different GPUs using nn.DataParallel. Has anyone found a solution yet? None of the solutions above work for me. CUDA version: 11.7 |
Same here |
🐛 Bug
0% utilization in second GPU in 2x GPUs training
Is the second GPUs only used to store tensors? Is the multi GPUs training in this codebase specially implemented, such that it is different from the multi GPUs training in PyTorch?
To Reproduce
Run training code with 2 GPUs
Expected behavior
Comparable utilization in 2 GPUs?
Environment
UPDATE: Note that this is actually a wrong description of the problem but is still kept here just to keep the flow. The correct description of the problem is in the post below.
The text was updated successfully, but these errors were encountered: