Program turns into zombie process when killed using `ctrl-c` #659

yxchng · 2019-04-10T09:28:43Z

🐛 Bug

0% utilization in second GPU in 2x GPUs training

Is the second GPUs only used to store tensors? Is the multi GPUs training in this codebase specially implemented, such that it is different from the multi GPUs training in PyTorch?

To Reproduce

Run training code with 2 GPUs

Expected behavior

Comparable utilization in 2 GPUs?

Environment

PyTorch version: 1.0.0.dev20190409
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: TITAN X (Pascal)

Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
Pillow (6.0.0)

UPDATE: Note that this is actually a wrong description of the problem but is still kept here just to keep the flow. The correct description of the problem is in the post below.

fmassa · 2019-04-10T09:45:11Z

How did you launch your 2 GPU job? This behavior is not expected.

fmassa · 2019-04-10T09:51:15Z

Also, I just noticed that you have two different GPUs. What might be happening is that the fastest GPU is waiting for the slowest GPU to finish its iteration.

It seems that 2080Ti do not have peer2peer enabled, which can make multi-GPU training much slower as memory transfer between GPUs should pass via the CPU

https://www.pugetsystems.com/labs/hpc/P2P-peer-to-peer-on-NVIDIA-RTX-2080Ti-vs-GTX-1080Ti-GPUs-1331/

yxchng · 2019-04-11T13:44:05Z

I reinstalled NVIDIA driver and installed the latest pytorch-nightly and the problem disappears.

yxchng · 2019-04-11T16:25:27Z

@fmassa My previous assessment of the problem was wrong. The actual problem is that the program turns into zombie process often when I ctrl-c to kill it, meaning it is no longer running but still hogging the memory and appears in top and nvidia-smi. The 100% utilization displayed in nvidia-smi is misleading because the program has already stopped. I always have to kill each started process manually by their PIDs using the kill command. Sometimes, even killing doesn't work. In those cases, I can only reboot my computer.

I run my GPU job using the command

NGPU=2
python -m torch.distributed.launch --nproc_per_node=$NGPU tools/train_net.py --config-file configs/<...>

One of the config files I used is as such

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/FAIR/20171220/X-101-32x8d"
  BACKBONE:
    CONV_BODY: "R-101-FPN"
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 256
    STRIDE_IN_1X1: False
    NUM_GROUPS: 32
    WIDTH_PER_GROUP: 8
DATASETS:
  TRAIN: ("crowdhuman_train", )
  TEST: ("crowdhuman_val",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  IMS_PER_BATCH: 2
TEST:
  IMS_PER_BATCH: 2
INPUT:
  MIN_SIZE_TRAIN: (800,)
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333
OUTPUT_DIR: "results/exp2"

I have tried testing with other configs as well and the problem remains.

I am quite sure there is a bug in the code because this has happened in 2 different computers (I tried running it on AWS using 2x P100s as well).

Environment on AWS

PyTorch version: 1.1.0a0+be364ac
Is debug build: No
CUDA used to build PyTorch: 10.1.105

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 410.104
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.3.2
[pip] numpy==1.16.2
[pip] torch==1.1.0a0+be364ac
[pip] torchtext==0.4.0
[pip] torchvision==0.2.1
[conda] blas 1.0 mkl anaconda
[conda] magma-cuda100 2.1.0 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] mkl_fft 1.0.10 py36ha843d7b_0 anaconda
[conda] mkl_random 1.0.2 py36hd81dba3_0 anaconda
[conda] torch 1.1.0a0+be364ac pypi_0 pypi
[conda] torchtext 0.4.0 pypi_0 pypi
[conda] torchvision 0.2.1 pypi_0 pypi
Pillow (5.3.0.post0)

I thought I have solved it but apparently not.

fmassa · 2019-04-11T18:36:04Z

This is a problem with the cleanup in PyTorch distributed launch utility, when one of the process dies the others might not be killed.

ccing @pietern to know if he has ideas on how to avoid this situation.

chengyangfu · 2019-04-12T02:30:26Z

If you use ctrl-c to stop the program, be careful to kill every process carefully. In your case(2gpus), there are around 2 + 8(dataloading) processes. I usually will run ps aux | grep python to kill everything related to the training program.

yxchng · 2019-04-12T11:24:34Z

@chengyangfu My expectation is that the ctrl-c signal should propagate to every process and I shouldn't have to manually kill all of them. A good implementation of multiprocessing code should not have such problem and so I would think that this is a bug. I have not have time to read through the code yet but is this library just using tools provided by PyTorch such that the problem lies in PyTorch? It is still strange though because I have been using PyTorch's DataParallel all the time in my other multi-GPUs training code and I have not met such problem.

I was browsing through the issues and seems like this issue #58 is related to the problem discussed here. Their root problem is probably the same where the coordination and communication among the many launched processes are problematic.

ray-lee-94 · 2019-04-18T07:44:22Z

I meet the same problem

sanshibayuan · 2019-07-10T10:59:36Z

Same here

Marcovaldong · 2019-07-20T05:06:57Z

I met a similar problem. I trained the model with 4 gpus. Training for thousand mini-batches, one process dead (I cannot get when and how it dead) and the utilization of the other three gpus are maintained at 100%, but the training has been stopped.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 20%   27C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 20%   32C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 20%   28C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 20%   30C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 24%   58C    P2    77W / 250W |   3764MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 20%   54C    P2    78W / 250W |   4110MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 20%   29C    P8    15W / 250W |     41MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 24%   58C    P2    74W / 250W |   3906MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    4     65245      C   ...ongsq/environments/anaconda3/bin/python  3731MiB |
|    5     65246      C   ...ongsq/environments/anaconda3/bin/python  4077MiB |
|    7     65248      C   ...ongsq/environments/anaconda3/bin/python  3873MiB |
+-----------------------------------------------------------------------------+

As shown above, the process which pid should be 65247 has been killed for some reason, How should I fix this problem? I cannot reinstall nvidia driver because of no root right.

pietern · 2019-07-22T11:13:34Z

@Marcovaldong This is not related to the zombie process problem tracked in this issue.

What you're seeing is that a single process crashing causes the remaining processes to launch NCCL kernels that will never complete. This is a known problem with NCCL and has been addressed in the most recent minor release (2.4). There is work in progress to add the error detection to the NCCL bindings in PyTorch in pytorch/pytorch#22907. Once that is done and merged, the remaining processes will raise an error once one of its peers is no longer reachable or has crashed.

Marcovaldong · 2019-07-22T11:59:27Z

@pietern Thanks for your reply. I have fixed my problem. There is a dirty sample in my 700k train dataset, I have checked out it.

jrsykes · 2022-05-20T13:23:25Z

I'm still having this issue in 2022. It occurs when my training process goes awry and a tensor of NaN values is fed to torch.nn.functional.binary_cross_entropy. I then have to close the terminal window and cannot kill the resulting zombie process. The only solution seems to be to restart the server.
It may be a coincidence but this behaviour is new since upgrading the NVIDIA driver a few days ago.

p.s. training with two different GPUs using nn.DataParallel.

Has anyone found a solution yet? None of the solutions above work for me.

CUDA version: 11.7
PyTorch: 1.11.0
Python 3.7.13
Ubuntu 18.04.6 LTS
NVIDIA-SMI 515.43.04
Driver Version: 515.43.04

ilml · 2023-10-06T03:01:56Z

Same here

yxchng closed this as completed Apr 11, 2019

yxchng reopened this Apr 11, 2019

yxchng changed the title ~~0% utilization in second GPU in 2x GPUs training~~ Program turns into zombie process Apr 11, 2019

yxchng changed the title ~~Program turns into zombie process~~ Program turns into zombie process for some reasons Apr 11, 2019

yxchng changed the title ~~Program turns into zombie process for some reasons~~ Program turns into zombie process when it is killed using ctrl-c Apr 11, 2019

yxchng changed the title ~~Program turns into zombie process when it is killed using ctrl-c~~ Program turns into zombie process when killed using ctrl-c Apr 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program turns into zombie process when killed using `ctrl-c` #659

Program turns into zombie process when killed using `ctrl-c` #659

yxchng commented Apr 10, 2019 •

edited

Loading

fmassa commented Apr 10, 2019

fmassa commented Apr 10, 2019

yxchng commented Apr 11, 2019

yxchng commented Apr 11, 2019 •

edited

Loading

fmassa commented Apr 11, 2019

chengyangfu commented Apr 12, 2019

yxchng commented Apr 12, 2019 •

edited

Loading

ray-lee-94 commented Apr 18, 2019

sanshibayuan commented Jul 10, 2019

Marcovaldong commented Jul 20, 2019

pietern commented Jul 22, 2019

Marcovaldong commented Jul 22, 2019

jrsykes commented May 20, 2022

ilml commented Oct 6, 2023

Program turns into zombie process when killed using ctrl-c #659

Program turns into zombie process when killed using ctrl-c #659

Comments

yxchng commented Apr 10, 2019 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

fmassa commented Apr 10, 2019

fmassa commented Apr 10, 2019

yxchng commented Apr 11, 2019

yxchng commented Apr 11, 2019 • edited Loading

Environment on AWS

fmassa commented Apr 11, 2019

chengyangfu commented Apr 12, 2019

yxchng commented Apr 12, 2019 • edited Loading

ray-lee-94 commented Apr 18, 2019

sanshibayuan commented Jul 10, 2019

Marcovaldong commented Jul 20, 2019

pietern commented Jul 22, 2019

Marcovaldong commented Jul 22, 2019

jrsykes commented May 20, 2022

ilml commented Oct 6, 2023

Program turns into zombie process when killed using `ctrl-c` #659

Program turns into zombie process when killed using `ctrl-c` #659

yxchng commented Apr 10, 2019 •

edited

Loading

yxchng commented Apr 11, 2019 •

edited

Loading

yxchng commented Apr 12, 2019 •

edited

Loading