Training GPU+CPU Utilization Stops #20

vjsrinivas · 2021-01-12T20:49:20Z

I'm trying the training procedure as laid out by the README file, and ran CUDA_VISIBLE_DEVICES=0 python main.py &>log &.

It seems to run fine up until the near end of the first epoch, where the GPU and CPU utilization completely stops. This drop in utilization never recovers and makes it so that the first epoch never actually finishes.

Here is the output from my log:

Namespace(batch_size=4, cache_dir='./checkpoints', data_root='hicodet', human_thresh=0.2,
learning_rate=0.001, lr_decay=0.1, milestones=[10], model_path='', momentum=0.9, 
num_epochs=15, num_iter=2, num_workers=4, object_thresh=0.2, print_interval=2000,
random_seed=1, train_detection_dir='hicodet/detections/train2015', 
val_detection_dir='hicodet/detections/test2015', weight_decay=0.0001)
Epoch [1/15], Iter. [2000/9409], Loss: 1.3726, Time[Data/Iter.]: [3.80s/1123.74s]
Epoch [1/15], Iter. [4000/9409], Loss: 1.1580, Time[Data/Iter.]: [3.53s/1105.00s]
Epoch [1/15], Iter. [6000/9409], Loss: 1.0998, Time[Data/Iter.]: [3.52s/1102.66s]
Epoch [1/15], Iter. [8000/9409], Loss: 1.0792, Time[Data/Iter.]: [3.72s/1140.21s]

My system specs as well:
OS: Pop!_OS 20.04 LTS x86_64
CPU: AMD Ryzen 7 2700X (16) @ 3.700G
GPU: NVIDIA GeForce RTX 2070 SUPER
Memory: 16017MiB
CUDA: 10.2

The text was updated successfully, but these errors were encountered:

vjsrinivas · 2021-01-12T21:36:45Z

I believe it might be related to another issue I was having with evaluating detections on the hicodet repo.

When I run python eval_detections.py --detection-root ./test2015, it loads in all 9k detections but does nothing afterward (i.e. no CPU utilization). I believe this issue and the hicodet problem are the same since you seem to run DetectionAPMeter at the end of each training epoch as well as in eval_detections.py.

fredzzhang · 2021-01-13T00:44:16Z

Hi, @vjsrinivas

Thanks for reporting the problem.

I did encounter something similar in a new environment different from what the repo was developed in. You are correct, the problem seems to be related to DetectionAPMeter. This class invokes multiprocessing when computing mAP, which hangs the process for some reason. Since the DetectionAPMeter class and all of its computation is CPU-based, I have ruled out the possibility of CUDA having unknown issues.

The code was developed under Ubuntu 14.04 LTS with Python 3.7.9. I think it might have something to do either of these two things. What is the Python version you are running?

Cheers,
Fred

vjsrinivas · 2021-01-13T01:01:21Z

Thanks for the quick response @fredzzhang
I am running Python 3.7.9 as well, but I'm on Pop_OS 20.04.

fredzzhang · 2021-01-13T01:15:57Z

The OS could be the reason. I had the same issue when I was running on Ubuntu 20.04 LTS.

See if disabling the multiprocessing works. Start with the eval_detections.py script since this ones runs really fast. Make the following change

@@ -28,7 +28,7 @@ def compute_map(
     num_pairs_object = torch.zeros(80)
     associate = BoxAssociation(min_iou=min_iou)
     meter = DetectionAPMeter(
-        80, algorithm='INT', nproc=10
+        80, algorithm='INT', nproc=1
     )
     # Skip images without valid human-object pairs
     valid_idx = dataset._idx

By specifying the nproc=1, this will internally disable multiprocessing. If the process no longer hangs, then make the same change in the training script. Go to spatio-attentive-graphs/utils.py. This is where the learning engine is defined. Make the following change

@@ -172,7 +172,7 @@ class CustomisedLE(LearningEngine):
         self.num_classes = num_classes
 
     def _on_start(self):
-        self.meter = DetectionAPMeter(self.num_classes, algorithm='11P')
+        self.meter = DetectionAPMeter(self.num_classes, algorithm='11P', nproc=1)
         self.timer = HandyTimer(maxlen=2)

Let me know if this solves the issue.

Cheers,
Fred

vjsrinivas · 2021-01-13T01:40:18Z

Thank you for your solution. It seems to have worked!

I was also working on a quick solution for this, and got the multi-threading to work by modifying meters.py in pocket as such:

...
"""
The Australian National University
Australian Centre for Robotic Vision
"""
import multiprocessing
multiprocessing.set_start_method("spawn", force=True)

import time
...

and replacing with multiprocessing.Pool(nproc) as pool in compute_ap with with multiprocessing.get_context("spawn").Pool(nproc) as pool:.

Although, I have no idea if this fix will cause problems in other distros. I used this as reference.

fredzzhang · 2021-01-13T02:18:03Z

Thank you very much for the reference! I'll update meters.py to override the method that starts subprocess.

Fred

fredzzhang added the bug Something isn't working label Jan 13, 2021

vjsrinivas closed this as completed Jan 13, 2021

fredzzhang mentioned this issue Jan 13, 2021

DetectionAPMeter results in deadlock in some environments fredzzhang/pocket#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training GPU+CPU Utilization Stops #20

Training GPU+CPU Utilization Stops #20

vjsrinivas commented Jan 12, 2021

vjsrinivas commented Jan 12, 2021

fredzzhang commented Jan 13, 2021

vjsrinivas commented Jan 13, 2021

fredzzhang commented Jan 13, 2021

vjsrinivas commented Jan 13, 2021

fredzzhang commented Jan 13, 2021

Training GPU+CPU Utilization Stops #20

Training GPU+CPU Utilization Stops #20

Comments

vjsrinivas commented Jan 12, 2021

vjsrinivas commented Jan 12, 2021

fredzzhang commented Jan 13, 2021

vjsrinivas commented Jan 13, 2021

fredzzhang commented Jan 13, 2021

vjsrinivas commented Jan 13, 2021

fredzzhang commented Jan 13, 2021