The test time of the model on GPU is normal, but the test time on CPU is very long. #15108

xianyujie · 2019-05-31T02:55:51Z

xianyujie
May 31, 2019

Hi, I trained the same model as the original model and only changed the datasets. When I test it On GPU, the test time of my model is similar to that of the original model, however, the time of my model tested On CPU is very larger than that of the original model. Why?

Answered by xianyujie

Jun 10, 2019

After I set the minimum value to 0, the test results of the two models are the same.

same image as input, get the output(pre_output1,pre_output2) from stage1_unit1_relu1 layer
of the two models.Then test the time of Conv layer, pre_output1 as the input of my model,
pre_output2 as the input of original model.
(0.004521, 0.004521)
The pre_output1 as the input of the two models, test the time of Conv layer
(0.004198, 0.004238)
The pre_output2 as the input of the two models, test the time of Conv layer
(0.004196, 0.004184)

View full answer

mxnet-label-bot · 2019-05-31T02:55:55Z

mxnet-label-bot
May 31, 2019

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

0 replies

frankfliu · 2019-05-31T04:35:20Z

frankfliu
May 31, 2019

@mxnet-label-bot add [Question]

0 replies

pengzhao-intel · 2019-05-31T13:15:02Z

pengzhao-intel
May 31, 2019
Collaborator

@xianyujie did you install the package of mxnet-cuXXmkl?
The mkl build shows very great performance. Below two links give more details.

http://mxnet.incubator.apache.org/versions/master/faq/perf.html#intel-cpu
https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/MKLDNN_README.html

0 replies

xianyujie · 2019-06-03T06:13:55Z

xianyujie
Jun 3, 2019
Author

@pengzhao-intel the mkl build brings great performance to both models, but the issue still exits.
By testing the run time of different network layers, I found that the problem occurred in most of the conv2_weight layers. The symbol.json of both models are same, and it's just that the value of arg_params is different, why does it make a huge difference in the running time of the model?

0 replies

pengzhao-intel · 2019-06-03T06:17:47Z

pengzhao-intel
Jun 3, 2019
Collaborator

Could you share the log or reproducible case for us?

0 replies

xianyujie · 2019-06-03T06:26:39Z

xianyujie
Jun 3, 2019
Author

Here, I tested the run time of different network layers on one image, I found that the problem occurred in most of the conv2_weight layers.
Tested on CPU, no mkl build, the first col shows my model, the second col shows the original model, the third col shows the network layer name.
all_output_time_contrast.txt

0 replies

pengzhao-intel · 2019-06-03T06:40:42Z

pengzhao-intel
Jun 3, 2019
Collaborator

I don't find too much useful information from the log.
Could you give the step of reproducing the issue? Maybe by MXNet examples.

0 replies

xianyujie · 2019-06-04T07:57:26Z

xianyujie
Jun 4, 2019
Author

@pengzhao-intel Can you find out what the problem is?
(The stage_unit1_relu1 is the front layer of the stage_unit1_conv1 layer)

===================The shape of the input image is (1,3,112,112)====================
warm up 20 times
same image as input,test the time from input layer to stage_unit1_relu1 layer
(0.030597, 0.029112)
(0.029765, 0.027273)
(0.029243, 0.026952)
same image as input,test the time from input layer to stage_unit1_conv1 layer
(0.105118, 0.031303)
(0.098103, 0.031876)
(0.098488, 0.031119)
same image as input, get the output(pre_output1,pre_output2) from stage1_unit1_relu1 layer
of the two models.Then test the time of Conv layer, pre_output1 as the input of my model,
pre_output2 as the input of original model.
(0.075896, 0.006333)
(0.07515, 0.005862)
(0.075065, 0.005814)
The pre_output1 as the input of the two models, test the time of Conv layer
(0.072311, 0.072548)
(0.073366, 0.075132)
(0.074931, 0.074962)
The pre_output2 as the input of the two models, test the time of Conv layer
(0.0055, 0.005653)
(0.005488, 0.005642)
(0.005499, 0.005644)
('loading bin', 0)
(2, 3, 12, 12)
================The shape of the input image is (1,3,12,12)========================
warm up 20 times
same image as input,test the time from input layer to stage_unit1_relu1 layer
(0.000733, 0.000735)
(0.000723, 0.000767)
(0.000761, 0.00073)
same image as input,test the time from input layer to stage_unit1_conv1 layer
(0.003494, 0.000923)
(0.003302, 0.000857)
(0.003363, 0.000906)
same image as input, get the output(pre_output1,pre_output2) from stage1_unit1_relu1 layer
of the two models.Then test the time of Conv layer, pre_output1 as the input of my model,
pre_output2 as the input of original model.
(0.002808, 0.000264)
(0.002637, 0.000262)
(0.002527, 0.000257)
The pre_output1 as the input of the two models, test the time of Conv layer
(0.002551, 0.00255)
(0.002455, 0.002451)
(0.002425, 0.002443)
The pre_output2 as the input of the two models, test the time of Conv layer
(0.000261, 0.000258)
(0.000259, 0.000255)
(0.00026, 0.000255)

0 replies

pengzhao-intel · 2019-06-04T13:41:54Z

pengzhao-intel
Jun 4, 2019
Collaborator

It makes sense because your input image size is about 10X in both H and W direction from 12 to 112.
So the runtime is increased. Btw, the no mkl build will be very slow and it's not performant.

I suggest you change to MKLDNN build for the start point.

0 replies

xianyujie · 2019-06-05T05:32:49Z

xianyujie
Jun 5, 2019
Author

@pengzhao-intel I think you misunderstood the result. Take a look at the following results, different inputs have a great influence on the operation time of convolution layer, What could be the reason for this result?

same image as input, get the output(pre_output1,pre_output2) from stage1_unit1_relu1 layer
of the two models.Then test the time of Conv layer, pre_output1 as the input of my model,
pre_output2 as the input of original model.
(my_model: 0.075896, original model: 0.006333)
The pre_output1 as the input of the two models, test the time of Conv layer
(my_model: 0.072311, original model: 0.072548)
The pre_output2 as the input of the two models, test the time of Conv layer
(my_model: 0.0055, original model: 0.005653)

0 replies

pengzhao-intel · 2019-06-05T07:45:38Z

pengzhao-intel
Jun 5, 2019
Collaborator

Got it. It's interesting that the runtime is very different.

Does the result of your models is correct?

0 replies

xianyujie · 2019-06-05T09:38:07Z

xianyujie
Jun 5, 2019
Author

@pengzhao-intel Yeah, I've tested it many times. Here is my test file link:
https://drive.google.com/open?id=1pzWZFY2cXsqYphit1HpMdGjntfSgIb9w

0 replies

xianyujie · 2019-06-05T09:42:30Z

xianyujie
Jun 5, 2019
Author

I've saved the pre_output1, the pre_output2, the conv1_weight, and the conv2_weight into the param file, and reload them to test.

0 replies

pengzhao-intel · 2019-06-09T03:34:01Z

pengzhao-intel
Jun 9, 2019
Collaborator

Thanks for the data and I can reproduce your issue now.
Seems the convolution time is quick longer when you switched to the new dataset.

Debugging now and will back to you soon.

(base) [patric@mlt-skx132 test]$ python test.py
mkldnn_verbose,info,Intel(R) MKL-DNN v0.19.0 (Git Hash 41bee20d7eb4a67feeeeb8d597b3598994eb1959),Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with AVX512BW, AVX512VL, and AVX512DQ extensions
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw16c,num:1,1x64x112x112,0.628906
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,64x64x3x3,0.027832
mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,fsrc:nChw16c fwei:OIhw16i16o fbia:undef fdst:nChw16c,alg:convolution_direct,mb1_ic64oc64_ih112oh56kh3sh2dh0ph1_iw112ow56kw3sw2dw0pw1,**22.302**
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,64x64x3x3,0.027832
0.030622


(base) [patric@mlt-skx132 test]$ python test.py
mkldnn_verbose,info,Intel(R) MKL-DNN v0.19.0 (Git Hash 41bee20d7eb4a67feeeeb8d597b3598994eb1959),Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with AVX512BW, AVX512VL, and AVX512DQ extensions
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw16c,num:1,1x64x112x112,0.468994
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,64x64x3x3,0.0290527
mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,fsrc:nChw16c fwei:OIhw16i16o fbia:undef fdst:nChw16c,alg:convolution_direct,mb1_ic64oc64_ih112oh56kh3sh2dh0ph1_iw112ow56kw3sw2dw0pw1,**0.687012**
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,64x64x3x3,0.026123
0.009102

0 replies

pengzhao-intel · 2019-06-10T02:14:04Z

pengzhao-intel
Jun 10, 2019
Collaborator

After the analysis, I think the problem is the input data is underflow so the computation is very slow.
I think the problem will disappear and when you switch to a real data

[[[[-2.54311177e-28 -1.54346012e-29 -1.86278236e-27 ... 8.16160440e-02
3.19217145e-02 2.33461082e-01]
[-5.60695946e-27 -6.00699591e-27 -7.86982584e-27 ... 1.92388237e-01
9.94710028e-02 2.34758019e-01]
[-2.91797916e-27 -8.43729205e-28 -3.13466534e-27 ... -1.47056355e-27
-1.57520761e-27 1.33262098e-01]
...
[ 1.02718532e-01 -4.38827914e-27 -3.15236521e-27 ... -5.44395992e-27
-7.84003017e-27 -8.97732492e-27]
[ 4.15135205e-01 4.14867938e-01 4.66439784e-01 ... -7.93092096e-27
-1.08695110e-26 -1.25687667e-26]
[-4.19972906e-27 -4.36781459e-27 -3.73858247e-27 ... -2.86622383e-27
-2.13053689e-27 -6.52746731e-27]]

[[-6.91081895e-38 -6.91081895e-38 -6.91081895e-38 ... -6.91081895e-38
-6.91081895e-38 -6.91081895e-38]
[-6.91081895e-38 -6.91081895e-38 -6.91081895e-38 ... -6.91081895e-38
-6.91081895e-38 -6.91081895e-38]
[-6.91081895e-38 -6.91081895e-38 -6.91081895e-38 ... -6.91081895e-38
-6.91081895e-38 -6.91081895e-38]
...
[-6.91081895e-38 -6.91081895e-38 -6.91081895e-38 ... -6.91081895e-38
-6.91081895e-38 -6.91081895e-38]
[-6.91081895e-38 -6.91081895e-38 -6.91081895e-38 ... -6.91081895e-38
-6.91081895e-38 -6.91081895e-38]
[-6.91081895e-38 -6.91081895e-38 -6.91081895e-38 ... -6.91081895e-38
-6.91081895e-38 -6.91081895e-38]]

0 replies

xianyujie · 2019-06-10T02:41:04Z

xianyujie
Jun 10, 2019
Author

many thx, but the input data is obtained from the real data through many layers, and the results of all the real pictures are the same as above, so maybe the minimal calculation result is directly equal to 0?

0 replies

pengzhao-intel · 2019-06-10T02:58:45Z

pengzhao-intel
Jun 10, 2019
Collaborator

many thx, but the input data is obtained from the real data through many layers, and the results of all the real pictures are the same as above, so maybe the minimal calculation result is directly equal to 0?

Make sense. I will contact the MKL-DNN team and see if any improvement can be done.

In a short time, please try to set the very small value to 0 :(

0 replies

xianyujie · 2019-06-10T04:33:55Z

xianyujie
Jun 10, 2019
Author

After I set the minimum value to 0, the test results of the two models are the same.

same image as input, get the output(pre_output1,pre_output2) from stage1_unit1_relu1 layer
of the two models.Then test the time of Conv layer, pre_output1 as the input of my model,
pre_output2 as the input of original model.
(0.004521, 0.004521)
The pre_output1 as the input of the two models, test the time of Conv layer
(0.004198, 0.004238)
The pre_output2 as the input of the two models, test the time of Conv layer
(0.004196, 0.004184)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The test time of the model on GPU is normal, but the test time on CPU is very long. #15108

{{title}}

Replies: 18 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The test time of the model on GPU is normal, but the test time on CPU is very long. #15108

xianyujie May 31, 2019

Replies: 18 comments

mxnet-label-bot May 31, 2019

frankfliu May 31, 2019

pengzhao-intel May 31, 2019 Collaborator

xianyujie Jun 3, 2019 Author

pengzhao-intel Jun 3, 2019 Collaborator

xianyujie Jun 3, 2019 Author

pengzhao-intel Jun 3, 2019 Collaborator

xianyujie Jun 4, 2019 Author

pengzhao-intel Jun 4, 2019 Collaborator

xianyujie Jun 5, 2019 Author

pengzhao-intel Jun 5, 2019 Collaborator

xianyujie Jun 5, 2019 Author

xianyujie Jun 5, 2019 Author

pengzhao-intel Jun 9, 2019 Collaborator

pengzhao-intel Jun 10, 2019 Collaborator

xianyujie Jun 10, 2019 Author

pengzhao-intel Jun 10, 2019 Collaborator

xianyujie Jun 10, 2019 Author

xianyujie
May 31, 2019

mxnet-label-bot
May 31, 2019

frankfliu
May 31, 2019

pengzhao-intel
May 31, 2019
Collaborator

xianyujie
Jun 3, 2019
Author

pengzhao-intel
Jun 3, 2019
Collaborator

xianyujie
Jun 3, 2019
Author

pengzhao-intel
Jun 3, 2019
Collaborator

xianyujie
Jun 4, 2019
Author

pengzhao-intel
Jun 4, 2019
Collaborator

xianyujie
Jun 5, 2019
Author

pengzhao-intel
Jun 5, 2019
Collaborator

xianyujie
Jun 5, 2019
Author

xianyujie
Jun 5, 2019
Author

pengzhao-intel
Jun 9, 2019
Collaborator

pengzhao-intel
Jun 10, 2019
Collaborator

xianyujie
Jun 10, 2019
Author

pengzhao-intel
Jun 10, 2019
Collaborator

xianyujie
Jun 10, 2019
Author