[Enhance] Continue to speed up training. #6974

RangiLyu · 2022-01-10T04:06:58Z

Motivation

This PR continues on #6867.

Not only limit the opencv multi-processing but also set OMP and MKL threads to 1 if not set in the environment.
Also, switch the start method from spawn to fork to speed up the start time.

Comparison

V100 x8

system info:

sys.platform: linux                                                                                                                                                                                                                                    
Python: 3.8.11 (default, Aug  3 2021, 15:09:35) [GCC 7.5.0]                                                                                                                                                                                            
CUDA available: True                                                                                                                                                                                                                                   
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB                                                                                                                                                                                                              
CUDA_HOME: /mnt/lustre/share/cuda-11.2
NVCC: Build cuda_11.2.r11.2/compiler.29618528_0
GCC: gcc (GCC) 5.4.0
PyTorch: 1.9.0
TorchVision: 0.10.0
OpenCV: 4.5.3
MMCV: 1.4.0
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 11.2
MMDetection: 2.20.0+ff9bc39

YOLOX-s

launcher: slurm
workers per GPU: 8
file client: s3

OMP&MKL Thread	OpenCV thread	MP start method	Start time	datatime	time	Eta @ 3 epoch 1400 iter
default	default	spawn	10min	0.030	0.334	2 days, 11:10:22
default	0	spawn	10min	0.034	0.275	2 days, 3:03:15
1	default	spawn	9min	0.027	0.297	2 days, 5:27:36
default	default	fork	36s	0.027	0.327	2 days, 1:39:53
1	0	fork	24s	0.025	0.268	1 day, 18:31:02

Faster RCNN

launcher: slurm
workers per GPU: 2
file client: s3

OMP&MKL Thread	OpenCV thread	MP start method	Start time	datatime	time	Eta @ 1 epoch 4000 iter
default	default	spawn	1min19s	0.011	0.483	11:40:39
1	0	fork	12s	0.010	0.232	5:32:23

A100 x8

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: A100-SXM-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.2.r11.2/compiler.29618528_0
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.10.1
TorchVision: 0.11.2
OpenCV: 4.5.5
MMCV: 1.4.2
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.20.0+ff9bc39

YOLOX-s

launcher: torch
workers per GPU: 8
file client: hard disk

OMP&MKL Thread	OpenCV thread	MP start method	Start time	datatime	time	Eta @ 3 epoch 1400 iter
num CPU core	default	spawn	7min	0.039	0.317	2 days, 5:46:18
1	default	spawn	5min	0.036	0.308	2 days, 3:07:02
1	default	fork	10s	0.035	0.298	2 days, 1:57:50
1	0	fork	10s	0.016	0.189	1 day, 5:08:11

tools/train.py

codecov · 2022-01-14T07:23:40Z

Codecov Report

Merging #6974 (114d089) into dev (6f2e6d1) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##              dev    #6974      +/-   ##
==========================================
- Coverage   62.34%   62.34%   -0.01%     
==========================================
  Files         327      327              
  Lines       26129    26129              
  Branches     4424     4424              
==========================================
- Hits        16290    16289       -1     
- Misses       8970     8971       +1     
  Partials      869      869

Flag	Coverage Δ
unittests	`62.32% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
mmdet/models/dense_heads/base_dense_head.py	`88.70% <0.00%> (-1.70%)`	⬇️
mmdet/core/bbox/samplers/random_sampler.py	`80.55% <0.00%> (+5.55%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05a3fbe...114d089. Read the comment docs.

shinya7y · 2022-01-14T15:09:34Z

Does setup_multi_processes also work for tools/test.py?

RangiLyu · 2022-01-17T03:27:01Z

Does setup_multi_processes also work for tools/test.py?

Yes, it works. I'll add this both in train.py and test.py.

* [Enhance] Speed up training time. * set in cfg

[Enhance] Speed up training time.

58fe80c

RangiLyu requested review from hhaAndroid, jshilong and ZwwWayne January 10, 2022 04:07

RangiLyu added the enhancement New feature or request label Jan 10, 2022

RangiLyu self-assigned this Jan 10, 2022

ZwwWayne reviewed Jan 10, 2022

View reviewed changes

tools/train.py Outdated Show resolved Hide resolved

set in cfg

114d089

RangiLyu requested a review from ZwwWayne January 14, 2022 07:40

GT9505 mentioned this pull request Jan 17, 2022

Accelerating training speed. open-mmlab/mmtracking#395

Merged

ZwwWayne approved these changes Jan 17, 2022

View reviewed changes

ZwwWayne merged commit 4b87ddc into open-mmlab:dev Jan 17, 2022

fangyixiao18 mentioned this pull request Jan 18, 2022

[Enhance] speed up training open-mmlab/mmselfsup#181

Merged

6 tasks

gaotongxiao mentioned this pull request Jan 19, 2022

[Enhancement] Speed up training open-mmlab/mmocr#739

Merged

plyfager mentioned this pull request Jan 19, 2022

[Enhancement] Speed up training open-mmlab/mmgeneration#231

Merged

mzr1996 mentioned this pull request Jan 22, 2022

Training got stuck on CPU open-mmlab/mmpretrain#630

Closed

chhluo pushed a commit to chhluo/mmdetection that referenced this pull request Feb 21, 2022

[Enhance] Continue to speed up training. (open-mmlab#6974)

35f3f96

* [Enhance] Speed up training time. * set in cfg

jbwang1997 mentioned this pull request Mar 1, 2022

I ran test after training but it was stuck to compute mAP open-mmlab/mmrotate#45

Closed

RangiLyu mentioned this pull request Apr 25, 2022

Severe memory leaks when num_workers != 0 #7786

Open

ZwwWayne pushed a commit that referenced this pull request Jul 18, 2022

[Enhance] Continue to speed up training. (#6974)

549a556

* [Enhance] Speed up training time. * set in cfg

ZwwWayne pushed a commit to ZwwWayne/mmdetection that referenced this pull request Jul 19, 2022

[Enhance] Continue to speed up training. (open-mmlab#6974)

b59cec5

* [Enhance] Speed up training time. * set in cfg

RangiLyu deleted the save_life_v2 branch December 17, 2022 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhance] Continue to speed up training. #6974

[Enhance] Continue to speed up training. #6974

RangiLyu commented Jan 10, 2022 •

edited

Loading

codecov bot commented Jan 14, 2022 •

edited

Loading

shinya7y commented Jan 14, 2022

RangiLyu commented Jan 17, 2022

[Enhance] Continue to speed up training. #6974

[Enhance] Continue to speed up training. #6974

Conversation

RangiLyu commented Jan 10, 2022 • edited Loading

Motivation

Comparison

V100 x8

YOLOX-s

Faster RCNN

A100 x8

YOLOX-s

codecov bot commented Jan 14, 2022 • edited Loading

Codecov Report

shinya7y commented Jan 14, 2022

RangiLyu commented Jan 17, 2022

RangiLyu commented Jan 10, 2022 •

edited

Loading

codecov bot commented Jan 14, 2022 •

edited

Loading