Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training GPU+CPU Utilization Stops #20

Closed
vjsrinivas opened this issue Jan 12, 2021 · 6 comments
Closed

Training GPU+CPU Utilization Stops #20

vjsrinivas opened this issue Jan 12, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@vjsrinivas
Copy link

I'm trying the training procedure as laid out by the README file, and ran CUDA_VISIBLE_DEVICES=0 python main.py &>log &.

It seems to run fine up until the near end of the first epoch, where the GPU and CPU utilization completely stops. This drop in utilization never recovers and makes it so that the first epoch never actually finishes.

Here is the output from my log:

Namespace(batch_size=4, cache_dir='./checkpoints', data_root='hicodet', human_thresh=0.2,
learning_rate=0.001, lr_decay=0.1, milestones=[10], model_path='', momentum=0.9, 
num_epochs=15, num_iter=2, num_workers=4, object_thresh=0.2, print_interval=2000,
random_seed=1, train_detection_dir='hicodet/detections/train2015', 
val_detection_dir='hicodet/detections/test2015', weight_decay=0.0001)
Epoch [1/15], Iter. [2000/9409], Loss: 1.3726, Time[Data/Iter.]: [3.80s/1123.74s]
Epoch [1/15], Iter. [4000/9409], Loss: 1.1580, Time[Data/Iter.]: [3.53s/1105.00s]
Epoch [1/15], Iter. [6000/9409], Loss: 1.0998, Time[Data/Iter.]: [3.52s/1102.66s]
Epoch [1/15], Iter. [8000/9409], Loss: 1.0792, Time[Data/Iter.]: [3.72s/1140.21s]

My system specs as well:
OS: Pop!_OS 20.04 LTS x86_64
CPU: AMD Ryzen 7 2700X (16) @ 3.700G
GPU: NVIDIA GeForce RTX 2070 SUPER
Memory: 16017MiB
CUDA: 10.2

@vjsrinivas
Copy link
Author

I believe it might be related to another issue I was having with evaluating detections on the hicodet repo.

When I run python eval_detections.py --detection-root ./test2015, it loads in all 9k detections but does nothing afterward (i.e. no CPU utilization). I believe this issue and the hicodet problem are the same since you seem to run DetectionAPMeter at the end of each training epoch as well as in eval_detections.py.

@fredzzhang fredzzhang added the bug Something isn't working label Jan 13, 2021
@fredzzhang
Copy link
Owner

Hi, @vjsrinivas

Thanks for reporting the problem.

I did encounter something similar in a new environment different from what the repo was developed in. You are correct, the problem seems to be related to DetectionAPMeter. This class invokes multiprocessing when computing mAP, which hangs the process for some reason. Since the DetectionAPMeter class and all of its computation is CPU-based, I have ruled out the possibility of CUDA having unknown issues.

The code was developed under Ubuntu 14.04 LTS with Python 3.7.9. I think it might have something to do either of these two things. What is the Python version you are running?

Cheers,
Fred

@vjsrinivas
Copy link
Author

Thanks for the quick response @fredzzhang
I am running Python 3.7.9 as well, but I'm on Pop_OS 20.04.

@fredzzhang
Copy link
Owner

The OS could be the reason. I had the same issue when I was running on Ubuntu 20.04 LTS.

See if disabling the multiprocessing works. Start with the eval_detections.py script since this ones runs really fast. Make the following change

@@ -28,7 +28,7 @@ def compute_map(
     num_pairs_object = torch.zeros(80)
     associate = BoxAssociation(min_iou=min_iou)
     meter = DetectionAPMeter(
-        80, algorithm='INT', nproc=10
+        80, algorithm='INT', nproc=1
     )
     # Skip images without valid human-object pairs
     valid_idx = dataset._idx

By specifying the nproc=1, this will internally disable multiprocessing. If the process no longer hangs, then make the same change in the training script. Go to spatio-attentive-graphs/utils.py. This is where the learning engine is defined. Make the following change

@@ -172,7 +172,7 @@ class CustomisedLE(LearningEngine):
         self.num_classes = num_classes
 
     def _on_start(self):
-        self.meter = DetectionAPMeter(self.num_classes, algorithm='11P')
+        self.meter = DetectionAPMeter(self.num_classes, algorithm='11P', nproc=1)
         self.timer = HandyTimer(maxlen=2)

Let me know if this solves the issue.

Cheers,
Fred

@vjsrinivas
Copy link
Author

Thank you for your solution. It seems to have worked!

I was also working on a quick solution for this, and got the multi-threading to work by modifying meters.py in pocket as such:

...
"""
The Australian National University
Australian Centre for Robotic Vision
"""
import multiprocessing
multiprocessing.set_start_method("spawn", force=True)

import time
...

and replacing with multiprocessing.Pool(nproc) as pool in compute_ap with with multiprocessing.get_context("spawn").Pool(nproc) as pool:.

Although, I have no idea if this fix will cause problems in other distros. I used this as reference.

@fredzzhang
Copy link
Owner

Thank you very much for the reference! I'll update meters.py to override the method that starts subprocess.

Fred

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants