Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneDNN hardswish integration #30211

Merged
merged 19 commits into from
Feb 25, 2021
Merged

Conversation

jakpiase
Copy link
Contributor

@jakpiase jakpiase commented Jan 7, 2021

PR types

New features

PR changes

OPs

Describe

Added support for oneDNN hardswish activation function. Conv + activation and fc + activation fuse passes can now also fuse with hardswish activation.

Profiled on Intel(R) d on(R) Gold 6348H CPU @ 2.30GHz
warmup = 10, repeat = 100

  • CPU native
    image

  • oneDNN without hardswish
    image

  • oneDNN with hardswish
    image

Total times:
oneDNN without hardswish / oneDNN with hardswish = 1.19
CPU native / oneDNN with hardswish = 2.76

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jan 7, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@jakpiase
Copy link
Contributor Author

jakpiase commented Jan 7, 2021

@jczaja Could you please take a look?

@jakpiase jakpiase marked this pull request as ready for review January 26, 2021 17:39
@jakpiase jakpiase changed the title [DO NOT MERGE] OneDNN hardswish integration OneDNN hardswish integration Jan 26, 2021
@lidanqing-intel
Copy link
Contributor

@jakpiase Please do profiling with following config, thanks

void PrepareConfig(AnalysisConfig *config, int threads) {
  ...
  config->EnableMKLDNN();
  auto pass_builder = config->pass_builder();
  pass_builder->AppendPass("interpolate_mkldnn_pass");
}

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Jan 27, 2021

I profiled on my i9 machine, 10 warmup , 100 repeat

  • CPU Native
Total time: 29079.5
  Computation time       Total: 28937.1     Ratio: 99.5103%
  Framework overhead     Total: 142.393     Ratio: 0.48967%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 0           Total: 0           Ratio: 0%
-------------------------       Event Summary       -------------------------
Event                            Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::conv2d                  4510        13317.5     0.116527    67.3944     2.95288     0.457968
thread0::depthwise_conv2d        1650        8328.15     0.78423     12.7155     5.04736     0.286392
thread0::elementwise_add         5280        2598.19     0.026334    7.5885      0.492081    0.0893477
thread0::nearest_interp          660         1365.26     0.276725    11.2304     2.06858     0.0469493
thread0::conv2d_transpose        220         1134.83     2.17559     8.3012      5.15834     0.0390252
thread0::relu                    1540        1097.54     0.177974    6.24986     0.71269     0.0377428
thread0::hard_swish              2200        471.471     0.080303    1.49519     0.214305    0.0162131
thread0::batch_norm              1650        406.157     0.096041    1.57076     0.246156    0.0139671
thread0::concat                  110         258.622     2.32048     2.42123     2.35111     0.00889361
thread0::sigmoid                 110         71.0762     0.619227    0.693527    0.646148    0.0024442
thread0::scale                   110         29.1552     0.250245    0.313208    0.265047    0.0010026
thread0::load_combine            1           1.56114     1.56114     1.56114     1.56114     5.36851e-05
  • oneDNN without hard_swish
Total time: 15882.3
  Computation time       Total: 14388.5     Ratio: 90.5948%
  Framework overhead     Total: 1493.76     Ratio: 9.40523%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 0           Total: 0           Ratio: 0%
-------------------------       Event Summary       -------------------------
Event                            Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::conv2d                  6160        7719.16     0.147497    30.6164     1.25311     0.486023
  int_reorder                    2476        891.356     0.002698    4.12293     0.030945    0.115473*
thread0::conv2d_transpose        220         3396.88     11.9331     34.8972     15.4403     0.213878
  int_reorder                    2           0.211463    0.014027    0.197436    0.197436    6.22522e-05*
thread0::sigmoid                 110         1372.17     12.2731     20.0907     12.4743     0.0863962
thread0::hard_swish              2200        1104.43     0.170863    9.0632      0.502012    0.0695382
  ext_reorder                    2200        525.534     0.066959    4.2296      4.2296      0.475844*
thread0::scale                   110         709.301     6.35583     6.61771     6.44819     0.0446599
  ext_reorder                    110         357.398     3.18237     3.34164     3.29843     0.503874*
thread0::relu                    110         443.658     3.94713     7.9087      4.03325     0.0279341
thread0::elementwise_add         330         432.945     0.178199    5.61183     1.31196     0.0272596
thread0::concat                  110         391.932     3.49597     6.85412     3.56302     0.0246773
thread0::nearest_interp          660         310.28      0.103491    5.83737     0.470121    0.0195362
thread0::load_combine            1           1.54059     1.54059     1.54059     1.54059     9.70004e-05
Total time: 11505.1
  Computation time       Total: 11058.6     Ratio: 96.1186%
  Framework overhead     Total: 446.56      Ratio: 3.88139%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 0           Total: 0           Ratio: 0%
-------------------------       Event Summary       -------------------------
Event                            Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::conv2d                  6160        5998.91     0.160963    20.5096     0.973849    0.521411
  int_reorder                    166         136.692     0.004007    6.61985     6.61985     0.0227862*
thread0::conv2d_transpose        220         2864.84     11.1971     32.519      13.022      0.249005
  int_reorder                    112         115.58      0.018428    1.95845     1.95845     0.0403442*
thread0::sigmoid                 110         1015.53     9.02599     18.5955     9.2321      0.0882675
thread0::elementwise_add         330         429.875     0.177166    5.57589     1.30265     0.0373637
thread0::relu                    110         374.766     3.33516     7.332       3.40697     0.0325738
thread0::scale                   110         360.637     3.23246     3.97534     3.27852     0.0313457
  ext_reorder                    110         318.88      2.85873     3.06138     3.06138     0.884211*
thread0::nearest_interp          660         232.959     0.102815    5.89546     0.352969    0.0202483
thread0::concat                  110         226.047     2.00791     4.95641     2.05498     0.0196475
thread0::load_combine            1           1.58235     1.58235     1.58235  

Performance improvement:
This PR: 11388/11058 = 1.03X
oneDNN / CPUNative = 2.6X

@paddle-bot-old
Copy link

paddle-bot-old bot commented Feb 5, 2021

Sorry to inform you that d4e524e's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@juncaipeng
Copy link
Contributor

@lidanqing-intel Please verify the accuracy with enabling OneDNN hardswish.

Copy link
Contributor

@jczaja jczaja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jczaja
Copy link
Contributor

jczaja commented Feb 24, 2021

@luotao1 Could you please start your review?

@luotao1 luotao1 merged commit 2f11653 into PaddlePaddle:develop Feb 25, 2021
@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Feb 25, 2021

@jakpiase Juncai ask to cherry-pick this PR to release/2.0

@lidanqing-intel
Copy link
Contributor

@jakpiase
Update: Since cherry-pick this PR need to upgrade oneDNN 2.2 and release/2.0 is far from oneDNN 2.2, so do not cherry-pick

lidanqing-intel pushed a commit to lidanqing-intel/Paddle that referenced this pull request Mar 25, 2021
Superjomn pushed a commit that referenced this pull request Mar 31, 2021
* OneDNN hardswish integration (#30211)

* keep only conv + hardswish in this PR

Co-authored-by: jakpiase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants