Investigate performance regression in 0.4 #510

chfast · 2020-08-24T21:22:42Z

This partly reverts 67fd4c8#diff-47ed36245217950f22b14dcb2472adbaR60

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.1249         -0.1249            88            77            88            77
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.1254         -0.1254          1333          1166          1333          1166
fizzy/execute/ecpairing/onepoint_mean                             -0.1190         -0.1190        440627        388179        440631        388183
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.0935         -0.0935           104            94           104            94
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.1024         -0.1024          1526          1370          1526          1370
fizzy/execute/memset/256_bytes_mean                               -0.1131         -0.1131             7             6             7             6
fizzy/execute/memset/60000_bytes_mean                             -0.1102         -0.1102          1576          1402          1576          1403
fizzy/execute/mul256_opt0/input0_mean                             -0.1387         -0.1387            29            25            29            25
fizzy/execute/mul256_opt0/input1_mean                             -0.1375         -0.1375            29            25            29            25
fizzy/execute/ramanujan_pi/33_runs_mean                           -0.0849         -0.0849           131           120           131           120
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.0964         -0.0964            93            84            93            84
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.0919         -0.0919          1299          1179          1299          1179
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.1047         -0.1047            95            85            95            85
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.1063         -0.1063          1304          1165          1304          1165
fizzy/execute/taylor_pi/pi_1000000_runs_mean                      -0.0375         -0.0375         42757         41154         42758         41154
fizzy/execute/micro/eli_interpreter/halt_mean                     -0.0371         -0.0371             0             0             0             0
fizzy/execute/micro/eli_interpreter/exec105_mean                  -0.0951         -0.0951             5             4             5             4
fizzy/execute/micro/factorial/10_mean                             -0.0405         -0.0405             0             0             0             0
fizzy/execute/micro/factorial/20_mean                             -0.0413         -0.0413             1             1             1             1
fizzy/execute/micro/fibonacci/24_mean                             -0.0382         -0.0382          7507          7221          7508          7221
fizzy/execute/micro/host_adler32/1_mean                           -0.0133         -0.0133             0             0             0             0
fizzy/execute/micro/host_adler32/100_mean                         -0.0006         -0.0006             3             3             3             3
fizzy/execute/micro/host_adler32/1000_mean                        +0.0058         +0.0058            29            29            29            29
fizzy/execute/micro/spinner/1_mean                                -0.0340         -0.0340             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.1275         -0.1275            10             9            10             9

codecov · 2020-08-24T21:22:45Z

Codecov Report

Merging #510 into master will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #510   +/-   ##
=======================================
  Coverage   99.69%   99.69%           
=======================================
  Files          54       54           
  Lines       17183    17183           
=======================================
  Hits        17131    17131           
  Misses         52       52

axic · 2020-08-24T22:07:06Z

test/utils/fizzy_engine.cpp

-    const auto ret = fizzy::test::adler32(
-        bytes_view{*instance.memory}.substr(args[0].as<uint32_t>(), args[1].as<uint32_t>()));
+    const auto ret =
+        fizzy::test::adler32(bytes_view{*instance.memory}.substr(args[0].i64, args[1].i64));


This should have zero effect on the benchmarks, except for micro/host_adler32.

chfast · 2020-08-25T07:28:22Z

--- asm.bad	2020-08-25 09:24:08.712147961 +0200
+++ asm.good	2020-08-25 09:23:36.355214339 +0200
@@ -1,4 +1,4 @@
-      3e:	8b 48 08             	mov    0x8(%rax),%ecx
-      41:	48 8b 7a 08          	mov    0x8(%rdx),%rdi
-      45:	48 8b 32             	mov    (%rdx),%rsi
-      48:	8b 10                	mov    (%rax),%edx
+      3e:	48 8b 48 08          	mov    0x8(%rax),%rcx
+      42:	48 8b 7a 08          	mov    0x8(%rdx),%rdi
+      46:	48 8b 32             	mov    (%rdx),%rsi
+      49:	48 8b 10             	mov    (%rax),%rdx

The only function difference is changing %edx/%ecx to %rdx/%rcx in two mov instructions. But these instructions for bigger register are encoded with an additional byte. These two bytes pushes the following function by ~16 bytes. This seems to be the only explanation for the performance difference.

chfast · 2020-08-25T07:45:47Z

And these are benchmarks on Skylake CPU, after the "fix":

fizzy/execute/sha1/512_bytes_rounds_1_mean                     +0.1875         +0.1875           185           220           185           220
fizzy/execute/sha1/512_bytes_rounds_16_mean                    +0.1764         +0.1764          2580          3035          2580          3035
fizzy/execute/sha256/512_bytes_rounds_1_mean                   +0.1351         +0.1352           201           228           201           228
fizzy/execute/sha256/512_bytes_rounds_16_mean                  +0.1263         +0.1264          2775          3126          2775          3126

chfast · 2020-08-25T08:21:32Z

So I tried LTO builds for both Haswell (original source of the fix) and Skylake. This makes the benchmark results before and after the "fix" the same (i.e. stable), but the absolute values comparing with not-LTO build are ~6% slower.

I think I will go with LTO builds to compare performance to have more stable environment.

chfast · 2020-08-25T12:25:49Z

So the LTO builds make the fix obsolete, but there is still up to 10% performance regression on Haswell.

0.3:

 Performance counter stats for 'bin/fizzy-bench-0.3 ../../test/benchmarks --benchmark_min_time=0.5 --benchmark_filter=fizzy/execute/sha256/512_bytes_rounds_16' (10 runs):

            835,21 msec task-clock                #    0,998 CPUs utilized            ( +-  0,21% )
                 2      context-switches          #    0,002 K/sec                    ( +- 29,15% )
                 0      cpu-migrations            #    0,000 K/sec                  
               451      page-faults               #    0,540 K/sec                    ( +-  0,09% )
     3 336 801 786      cycles                    #    3,995 GHz                      ( +-  0,21% )  (30,74%)
     8 789 190 078      instructions              #    2,63  insn per cycle           ( +-  0,15% )  (38,55%)
     1 956 881 345      branches                  # 2342,977 M/sec                    ( +-  0,16% )  (38,60%)
        23 380 000      branch-misses             #    1,19% of all branches          ( +-  0,39% )  (38,65%)
     2 395 030 936      L1-dcache-loads           # 2867,574 M/sec                    ( +-  0,19% )  (37,54%)
           246 144      L1-dcache-load-misses     #    0,01% of all L1-dcache hits    ( +- 21,98% )  (15,37%)
            11 214      LLC-loads                 #    0,013 M/sec                    ( +- 34,16% )  (15,33%)
               305      LLC-load-misses           #    2,72% of all LL-cache hits     ( +- 79,03% )  (22,99%)
   <not supported>      L1-icache-loads                                             
            51 236      L1-icache-load-misses                                         ( +- 18,51% )  (30,65%)
     2 394 291 505      dTLB-loads                # 2866,689 M/sec                    ( +-  0,23% )  (29,69%)
            61 804      dTLB-load-misses          #    0,00% of all dTLB cache hits   ( +- 22,60% )  (15,33%)
             1 757      iTLB-loads                #    0,002 M/sec                    ( +- 40,63% )  (15,33%)
            12 300      iTLB-load-misses          #  700,09% of all iTLB cache hits   ( +- 64,69% )  (22,95%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

           0,83700 +- 0,00253 seconds time elapsed  ( +-  0,30% )

0.4:

 Performance counter stats for 'bin/fizzy-bench-0.4 ../../test/benchmarks --benchmark_min_time=0.5 --benchmark_filter=fizzy/execute/sha256/512_bytes_rounds_16' (10 runs):                                                        
                                                                                                                                                                                                                                  
            850,14 msec task-clock                #    0,997 CPUs utilized            ( +-  0,17% )                                                                                                                               
                 2      context-switches          #    0,003 K/sec                    ( +- 21,55% )                                                                                                                               
                 0      cpu-migrations            #    0,000 K/sec                                                                                                                                                                
               452      page-faults               #    0,532 K/sec                    ( +-  0,06% )                                                                                                                               
     3 396 397 508      cycles                    #    3,995 GHz                      ( +-  0,17% )  (30,39%)                                                                                                                     
     8 152 994 133      instructions              #    2,40  insn per cycle           ( +-  0,17% )  (38,38%)                                                                                                                     
     1 815 337 217      branches                  # 2135,350 M/sec                    ( +-  0,17% )  (38,85%)                                                                                                                     
        24 674 522      branch-misses             #    1,36% of all branches          ( +-  1,27% )  (39,33%)                                                                                                                     
     2 221 661 239      L1-dcache-loads           # 2613,301 M/sec                    ( +-  0,17% )  (38,37%)                                                                                                                     
           308 740      L1-dcache-load-misses     #    0,01% of all L1-dcache hits    ( +- 16,53% )  (15,42%)                                                                                                                     
            19 707      LLC-loads                 #    0,023 M/sec                    ( +- 35,50% )  (15,21%)                                                                                                                     
             1 854      LLC-load-misses           #    9,41% of all LL-cache hits     ( +- 94,26% )  (22,77%)                                                                                                                     
   <not supported>      L1-icache-loads                                                                                                                                                                                           
            44 572      L1-icache-load-misses                                         ( +- 18,91% )  (30,30%)                                                                                                                     
     2 220 953 935      dTLB-loads                # 2612,469 M/sec                    ( +-  0,18% )  (29,26%)                                                                                                                     
           268 863      dTLB-load-misses          #    0,01% of all dTLB cache hits   ( +-  2,53% )  (15,06%)                                                                                                                     
               847      iTLB-loads                #    0,996 K/sec                    ( +- 41,12% )  (15,06%)                                                                                                                     
            22 657      iTLB-load-misses          # 2674,97% of all iTLB cache hits   ( +- 48,67% )  (22,58%)                                                                                                                     
   <not supported>      L1-dcache-prefetches                                                                                                                                                                                      
   <not supported>      L1-dcache-prefetch-misses                                                                                                                                                                                 
                                                                                                                                                                                                                                  
           0,85228 +- 0,00230 seconds time elapsed  ( +-  0,27% )

This standing out counter is LLC-load-misses, what mean how many times CPU must go to RAM to get needed data. My interpretation is that this happens much more in 0.4 because of the increased size of the execute() function.

The iTLB-load-misses is also insanely high. Intel suggests to use PGO to put hot functions together. https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound/itlb-overhead.html

gumb0 · 2020-09-01T13:20:17Z

Here are my results on GCC 10.2

without LTO

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.1367         -0.1353           206           178           206           178
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.1329         -0.1326          3107          2694          3107          2695
fizzy/execute/ecpairing/onepoint_mean                             -0.1004         -0.1007       1038819        934491       1038782        934223
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   -0.1199         -0.1199           262           231           262           231
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  -0.1105         -0.1105          3845          3420          3845          3420
fizzy/execute/memset/256_bytes_mean                               -0.1796         -0.1796            17            14            17            14
fizzy/execute/memset/60000_bytes_mean                             -0.1814         -0.1814          3629          2971          3629          2971
fizzy/execute/mul256_opt0/input0_mean                             -0.1811         -0.1811            65            53            65            53
fizzy/execute/mul256_opt0/input1_mean                             -0.1844         -0.1844            65            53            65            53
fizzy/execute/ramanujan_pi/33_runs_mean                           -0.2135         -0.2135           322           253           322           253
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.1692         -0.1693           219           182           219           182
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.1655         -0.1655          3040          2537          3040          2537
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.1475         -0.1475           236           201           236           201
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.1490         -0.1490          3242          2759          3242          2759
fizzy/execute/taylor_pi/pi_1000000_runs_mean                      -0.0996         -0.0996         93583         84259         93580         84256
fizzy/execute/micro/eli_interpreter/halt_mean                     -0.0507         -0.0507             0             0             0             0
fizzy/execute/micro/eli_interpreter/exec105_mean                  -0.0949         -0.0949            11            10            11            10
fizzy/execute/micro/factorial/10_mean                             +0.0015         +0.0015             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             -0.0056         -0.0056             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.0301         -0.0301         17449         16923         17448         16923
fizzy/execute/micro/host_adler32/1_mean                           -0.0236         -0.0235             0             0             0             0
fizzy/execute/micro/host_adler32/100_mean                         -0.1064         -0.1064             7             6             7             6
fizzy/execute/micro/host_adler32/1000_mean                        -0.1181         -0.1181            71            63            71            63
fizzy/execute/micro/spinner/1_mean                                -0.0018         -0.0018             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.0728         -0.0727            21            20            21            20

With LTO

fizzy/execute/blake2b/512_bytes_rounds_1_mean                     -0.1302         -0.1301           203           177           203           177
fizzy/execute/blake2b/512_bytes_rounds_16_mean                    -0.1151         -0.1150          3001          2656          3001          2656
fizzy/execute/ecpairing/onepoint_mean                             -0.0932         -0.0932       1031369        935240       1031296        935198
fizzy/execute/keccak256/512_bytes_rounds_1_mean                   +0.0108         +0.0109           244           247           244           247
fizzy/execute/keccak256/512_bytes_rounds_16_mean                  +0.0063         +0.0064          3569          3591          3568          3591
fizzy/execute/memset/256_bytes_mean                               -0.1163         -0.1162            17            15            17            15
fizzy/execute/memset/60000_bytes_mean                             -0.1370         -0.1370          3723          3213          3722          3212
fizzy/execute/mul256_opt0/input0_mean                             -0.1741         -0.1740            63            52            63            52
fizzy/execute/mul256_opt0/input1_mean                             -0.1698         -0.1697            63            52            63            52
fizzy/execute/ramanujan_pi/33_runs_mean                           -0.1576         -0.1575           316           266           316           266
fizzy/execute/sha1/512_bytes_rounds_1_mean                        -0.1425         -0.1425           218           187           218           187
fizzy/execute/sha1/512_bytes_rounds_16_mean                       -0.1444         -0.1443          3026          2589          3026          2589
fizzy/execute/sha256/512_bytes_rounds_1_mean                      -0.1615         -0.1614           243           204           243           204
fizzy/execute/sha256/512_bytes_rounds_16_mean                     -0.1671         -0.1671          3356          2795          3356          2795
fizzy/execute/taylor_pi/pi_1000000_runs_mean                      -0.0988         -0.0988         93472         84233         93463         84228
fizzy/execute/micro/eli_interpreter/halt_mean                     -0.0767         -0.0767             0             0             0             0
fizzy/execute/micro/eli_interpreter/exec105_mean                  -0.1280         -0.1279            11            10            11            10
fizzy/execute/micro/factorial/10_mean                             -0.0053         -0.0052             1             1             1             1
fizzy/execute/micro/factorial/20_mean                             -0.0124         -0.0124             2             2             2             2
fizzy/execute/micro/fibonacci/24_mean                             -0.0162         -0.0162         17336         17055         17335         17054
fizzy/execute/micro/host_adler32/1_mean                           -0.0613         -0.0613             0             0             0             0
fizzy/execute/micro/host_adler32/100_mean                         -0.1289         -0.1289             7             6             7             6
fizzy/execute/micro/host_adler32/1000_mean                        -0.1355         -0.1355            72            62            72            62
fizzy/execute/micro/spinner/1_mean                                -0.0379         -0.0379             0             0             0             0
fizzy/execute/micro/spinner/1000_mean                             -0.2153         -0.2152            24            19            24            19

chfast · 2020-09-02T10:05:23Z

I propose we recheck it again before releasing 0.5.0

axic · 2020-09-30T15:36:18Z

I propose we recheck it again before releasing 0.5.0

I suppose this was not done?

chfast · 2020-10-09T13:16:54Z

Still has some impact on performance, but falls into "small unrelated code change affects performance somewhere else" basket.

axic reviewed Aug 24, 2020

View reviewed changes

chfast force-pushed the fix_performance branch from 0785a12 to ed89bcf Compare August 25, 2020 07:51

chfast changed the title ~~Fix performance regression~~ Investigate performance regression in 0.4 Aug 26, 2020

axic added the optimization Performance optimization label Oct 9, 2020

chfast force-pushed the fix_performance branch from ed89bcf to 9454010 Compare October 9, 2020 12:59

Fix performance regression

41e7bbc

chfast force-pushed the fix_performance branch from 9454010 to 41e7bbc Compare October 9, 2020 13:07

chfast closed this Oct 9, 2020

axic deleted the fix_performance branch March 12, 2021 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance regression in 0.4 #510

Investigate performance regression in 0.4 #510

chfast commented Aug 24, 2020 •

edited

Loading

codecov bot commented Aug 24, 2020

axic Aug 24, 2020

chfast commented Aug 25, 2020

chfast commented Aug 25, 2020

chfast commented Aug 25, 2020

chfast commented Aug 25, 2020 •

edited

Loading

gumb0 commented Sep 1, 2020

chfast commented Sep 2, 2020

axic commented Sep 30, 2020

chfast commented Oct 9, 2020

Investigate performance regression in 0.4 #510

Investigate performance regression in 0.4 #510

Conversation

chfast commented Aug 24, 2020 • edited Loading

codecov bot commented Aug 24, 2020

Codecov Report

axic Aug 24, 2020

Choose a reason for hiding this comment

chfast commented Aug 25, 2020

chfast commented Aug 25, 2020

chfast commented Aug 25, 2020

chfast commented Aug 25, 2020 • edited Loading

gumb0 commented Sep 1, 2020

chfast commented Sep 2, 2020

axic commented Sep 30, 2020

chfast commented Oct 9, 2020

chfast commented Aug 24, 2020 •

edited

Loading

chfast commented Aug 25, 2020 •

edited

Loading