Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Program turns into zombie process when killed using ctrl-c #659

Open
yxchng opened this issue Apr 10, 2019 · 14 comments
Open

Program turns into zombie process when killed using ctrl-c #659

yxchng opened this issue Apr 10, 2019 · 14 comments

Comments

@yxchng
Copy link

yxchng commented Apr 10, 2019

🐛 Bug

0% utilization in second GPU in 2x GPUs training

Screenshot from 2019-04-10 17-25-07

Is the second GPUs only used to store tensors? Is the multi GPUs training in this codebase specially implemented, such that it is different from the multi GPUs training in PyTorch?

To Reproduce

Run training code with 2 GPUs

Expected behavior

Comparable utilization in 2 GPUs?

Environment

PyTorch version: 1.0.0.dev20190409
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: TITAN X (Pascal)

Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
Pillow (6.0.0)

UPDATE: Note that this is actually a wrong description of the problem but is still kept here just to keep the flow. The correct description of the problem is in the post below.

@fmassa
Copy link
Contributor

fmassa commented Apr 10, 2019

How did you launch your 2 GPU job? This behavior is not expected.

@fmassa
Copy link
Contributor

fmassa commented Apr 10, 2019

Also, I just noticed that you have two different GPUs. What might be happening is that the fastest GPU is waiting for the slowest GPU to finish its iteration.

It seems that 2080Ti do not have peer2peer enabled, which can make multi-GPU training much slower as memory transfer between GPUs should pass via the CPU

https://www.pugetsystems.com/labs/hpc/P2P-peer-to-peer-on-NVIDIA-RTX-2080Ti-vs-GTX-1080Ti-GPUs-1331/

@yxchng
Copy link
Author

yxchng commented Apr 11, 2019

I reinstalled NVIDIA driver and installed the latest pytorch-nightly and the problem disappears.

@yxchng yxchng closed this as completed Apr 11, 2019
@yxchng yxchng reopened this Apr 11, 2019
@yxchng yxchng changed the title 0% utilization in second GPU in 2x GPUs training Program turns into zombie process Apr 11, 2019
@yxchng
Copy link
Author

yxchng commented Apr 11, 2019

@fmassa My previous assessment of the problem was wrong. The actual problem is that the program turns into zombie process often when I ctrl-c to kill it, meaning it is no longer running but still hogging the memory and appears in top and nvidia-smi. The 100% utilization displayed in nvidia-smi is misleading because the program has already stopped. I always have to kill each started process manually by their PIDs using the kill command. Sometimes, even killing doesn't work. In those cases, I can only reboot my computer.

I run my GPU job using the command

NGPU=2
python -m torch.distributed.launch --nproc_per_node=$NGPU tools/train_net.py --config-file configs/<...>

One of the config files I used is as such

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/FAIR/20171220/X-101-32x8d"
  BACKBONE:
    CONV_BODY: "R-101-FPN"
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
  RESNETS:
    BACKBONE_OUT_CHANNELS: 256
    STRIDE_IN_1X1: False
    NUM_GROUPS: 32
    WIDTH_PER_GROUP: 8
DATASETS:
  TRAIN: ("crowdhuman_train", )
  TEST: ("crowdhuman_val",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  IMS_PER_BATCH: 2
TEST:
  IMS_PER_BATCH: 2
INPUT:
  MIN_SIZE_TRAIN: (800,)
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333
OUTPUT_DIR: "results/exp2"

I have tried testing with other configs as well and the problem remains.

I am quite sure there is a bug in the code because this has happened in 2 different computers (I tried running it on AWS using 2x P100s as well).

Environment on AWS

PyTorch version: 1.1.0a0+be364ac
Is debug build: No
CUDA used to build PyTorch: 10.1.105

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.105
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 410.104
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.3.2
[pip] numpy==1.16.2
[pip] torch==1.1.0a0+be364ac
[pip] torchtext==0.4.0
[pip] torchvision==0.2.1
[conda] blas 1.0 mkl anaconda
[conda] magma-cuda100 2.1.0 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] mkl_fft 1.0.10 py36ha843d7b_0 anaconda
[conda] mkl_random 1.0.2 py36hd81dba3_0 anaconda
[conda] torch 1.1.0a0+be364ac pypi_0 pypi
[conda] torchtext 0.4.0 pypi_0 pypi
[conda] torchvision 0.2.1 pypi_0 pypi
Pillow (5.3.0.post0)

I thought I have solved it but apparently not.

@yxchng yxchng changed the title Program turns into zombie process Program turns into zombie process for some reasons Apr 11, 2019
@yxchng yxchng changed the title Program turns into zombie process for some reasons Program turns into zombie process when it is killed using ctrl-c Apr 11, 2019
@yxchng yxchng changed the title Program turns into zombie process when it is killed using ctrl-c Program turns into zombie process when killed using ctrl-c Apr 11, 2019
@fmassa
Copy link
Contributor

fmassa commented Apr 11, 2019

This is a problem with the cleanup in PyTorch distributed launch utility, when one of the process dies the others might not be killed.

ccing @pietern to know if he has ideas on how to avoid this situation.

@chengyangfu
Copy link
Contributor

If you use ctrl-c to stop the program, be careful to kill every process carefully. In your case(2gpus), there are around 2 + 8(dataloading) processes. I usually will run ps aux | grep python to kill everything related to the training program.

@yxchng
Copy link
Author

yxchng commented Apr 12, 2019

@chengyangfu My expectation is that the ctrl-c signal should propagate to every process and I shouldn't have to manually kill all of them. A good implementation of multiprocessing code should not have such problem and so I would think that this is a bug. I have not have time to read through the code yet but is this library just using tools provided by PyTorch such that the problem lies in PyTorch? It is still strange though because I have been using PyTorch's DataParallel all the time in my other multi-GPUs training code and I have not met such problem.

I was browsing through the issues and seems like this issue #58 is related to the problem discussed here. Their root problem is probably the same where the coordination and communication among the many launched processes are problematic.

@ray-lee-94
Copy link

I meet the same problem

@sanshibayuan
Copy link

Same here

@Marcovaldong
Copy link

I met a similar problem. I trained the model with 4 gpus. Training for thousand mini-batches, one process dead (I cannot get when and how it dead) and the utilization of the other three gpus are maintained at 100%, but the training has been stopped.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 20%   27C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 20%   32C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 20%   28C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 20%   30C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 24%   58C    P2    77W / 250W |   3764MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 20%   54C    P2    78W / 250W |   4110MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 20%   29C    P8    15W / 250W |     41MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 24%   58C    P2    74W / 250W |   3906MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    4     65245      C   ...ongsq/environments/anaconda3/bin/python  3731MiB |
|    5     65246      C   ...ongsq/environments/anaconda3/bin/python  4077MiB |
|    7     65248      C   ...ongsq/environments/anaconda3/bin/python  3873MiB |
+-----------------------------------------------------------------------------+

As shown above, the process which pid should be 65247 has been killed for some reason, How should I fix this problem? I cannot reinstall nvidia driver because of no root right.

@pietern
Copy link

pietern commented Jul 22, 2019

@Marcovaldong This is not related to the zombie process problem tracked in this issue.

What you're seeing is that a single process crashing causes the remaining processes to launch NCCL kernels that will never complete. This is a known problem with NCCL and has been addressed in the most recent minor release (2.4). There is work in progress to add the error detection to the NCCL bindings in PyTorch in pytorch/pytorch#22907. Once that is done and merged, the remaining processes will raise an error once one of its peers is no longer reachable or has crashed.

@Marcovaldong
Copy link

@pietern Thanks for your reply. I have fixed my problem. There is a dirty sample in my 700k train dataset, I have checked out it.

@jrsykes
Copy link

jrsykes commented May 20, 2022

I'm still having this issue in 2022. It occurs when my training process goes awry and a tensor of NaN values is fed to torch.nn.functional.binary_cross_entropy. I then have to close the terminal window and cannot kill the resulting zombie process. The only solution seems to be to restart the server.
It may be a coincidence but this behaviour is new since upgrading the NVIDIA driver a few days ago.

p.s. training with two different GPUs using nn.DataParallel.

Has anyone found a solution yet? None of the solutions above work for me.

CUDA version: 11.7
PyTorch: 1.11.0
Python 3.7.13
Ubuntu 18.04.6 LTS
NVIDIA-SMI 515.43.04
Driver Version: 515.43.04

@ilml
Copy link

ilml commented Oct 6, 2023

Same here

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants