x86_64: align loop headers to 64 bytes #5004

pepyakin · 2022-10-04T10:40:56Z

This PR introduces alignment for the loop headers to 64 bytes on x86_64. I benchmarked this on AMD Zen 3 x5950 and it produced the following results.

instantiation :: cycles :: benchmarks/shootout-gimli/benchmark.wasm

  Δ = 26281.32 ± 4598.52 (confidence = 99%)

  pep-align-loops.so is 1.44x to 1.63x faster than main.so!

  [53890 75407.92 118592] main.so
  [36448 49126.60 87992] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-seqhash/benchmark.wasm

  Δ = 34856.12 ± 3933.07 (confidence = 99%)

  pep-align-loops.so is 1.46x to 1.57x faster than main.so!

  [82756 102507.28 150960] main.so
  [50592 67651.16 106658] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-keccak/benchmark.wasm

  Δ = 28160.16 ± 5341.58 (confidence = 99%)

  pep-align-loops.so is 1.40x to 1.59x faster than main.so!

  [59942 85105.40 132532] main.so
  [42262 56945.24 91494] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-heapsort/benchmark.wasm

  Δ = 30212.40 ± 4510.98 (confidence = 99%)

  pep-align-loops.so is 1.41x to 1.56x faster than main.so!

  [68714 92486.12 158032] main.so
  [47056 62273.72 97852] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm

  Δ = 26062.70 ± 4747.21 (confidence = 99%)

  pep-align-loops.so is 1.39x to 1.56x faster than main.so!

  [59262 81086.94 130798] main.so
  [40392 55024.24 87312] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-fib2/benchmark.wasm

  Δ = 34163.20 ± 3949.15 (confidence = 99%)

  pep-align-loops.so is 1.42x to 1.53x faster than main.so!

  [83232 106564.84 146948] main.so
  [55590 72401.64 104278] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-sieve/benchmark.wasm

  Δ = 30320.18 ± 4710.86 (confidence = 99%)

  pep-align-loops.so is 1.39x to 1.53x faster than main.so!

  [73882 96454.94 189584] main.so
  [50558 66134.76 110466] pep-align-loops.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  Δ = 26868.16 ± 4298.60 (confidence = 99%)

  pep-align-loops.so is 1.38x to 1.53x faster than main.so!

  [63682 85859.86 126990] main.so
  [43044 58991.70 83606] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-memmove/benchmark.wasm

  Δ = 25414.32 ± 4195.94 (confidence = 99%)

  pep-align-loops.so is 1.37x to 1.52x faster than main.so!

  [57460 82822.64 127364] main.so
  [46376 57408.32 98498] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm

  Δ = 39825.22 ± 5512.82 (confidence = 99%)

  pep-align-loops.so is 1.38x to 1.50x faster than main.so!

  [93432 130104.74 168538] main.so
  [57222 90279.52 166124] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-minicsv/benchmark.wasm

  Δ = 24848.90 ± 3936.23 (confidence = 99%)

  pep-align-loops.so is 1.37x to 1.51x faster than main.so!

  [60656 81573.48 123760] main.so
  [41990 56724.58 90338] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-random/benchmark.wasm

  Δ = 26969.48 ± 4123.30 (confidence = 99%)

  pep-align-loops.so is 1.35x to 1.48x faster than main.so!

  [67694 91377.04 137054] main.so
  [46206 64407.56 93228] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-nestedloop/benchmark.wasm

  Δ = 25948.46 ± 4511.46 (confidence = 99%)

  pep-align-loops.so is 1.34x to 1.48x faster than main.so!

  [69088 89657.66 136612] main.so
  [46784 63709.20 90916] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ctype/benchmark.wasm

  Δ = 27127.58 ± 4332.85 (confidence = 99%)

  pep-align-loops.so is 1.33x to 1.45x faster than main.so!

  [76228 96649.76 169218] main.so
  [55964 69522.18 109922] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-xblabla20/benchmark.wasm

  Δ = 20529.54 ± 4116.39 (confidence = 99%)

  pep-align-loops.so is 1.29x to 1.44x faster than main.so!

  [58446 76712.84 112710] main.so
  [42534 56183.30 83572] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-matrix/benchmark.wasm

  Δ = 24574.18 ± 4205.02 (confidence = 99%)

  pep-align-loops.so is 1.30x to 1.43x faster than main.so!

  [73576 92209.02 127092] main.so
  [50830 67634.84 120972] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-switch/benchmark.wasm

  Δ = 28385.24 ± 4903.24 (confidence = 99%)

  pep-align-loops.so is 1.30x to 1.42x faster than main.so!

  [78370 107118.02 161194] main.so
  [61778 78732.78 114172] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ratelimit/benchmark.wasm

  Δ = 23482.78 ± 4058.42 (confidence = 99%)

  pep-align-loops.so is 1.29x to 1.42x faster than main.so!

  [66096 89596.12 135694] main.so
  [48586 66113.34 98260] pep-align-loops.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 28377.42 ± 5294.94 (confidence = 99%)

  pep-align-loops.so is 1.29x to 1.42x faster than main.so!

  [85680 108899.96 149872] main.so
  [63138 80522.54 135082] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-base64/benchmark.wasm

  Δ = 22593.34 ± 4545.94 (confidence = 99%)

  pep-align-loops.so is 1.26x to 1.40x faster than main.so!

  [71060 91037.72 146880] main.so
  [54060 68444.38 107202] pep-align-loops.so

instantiation :: cycles :: benchmarks/noop/benchmark.wasm

  Δ = 17670.82 ± 4192.78 (confidence = 99%)

  pep-align-loops.so is 1.25x to 1.40x faster than main.so!

  [59908 72167.72 107202] main.so
  [43724 54496.90 93670] pep-align-loops.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 35288.26 ± 5934.66 (confidence = 99%)

  pep-align-loops.so is 1.26x to 1.37x faster than main.so!

  [102340 146073.52 194038] main.so
  [80784 110785.26 180472] pep-align-loops.so

instantiation :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm

  Δ = 21964.34 ± 4552.01 (confidence = 99%)

  pep-align-loops.so is 1.24x to 1.36x faster than main.so!

  [71502 95173.48 139842] main.so
  [54060 73209.14 108086] pep-align-loops.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 30523.16 ± 5684.40 (confidence = 99%)

  pep-align-loops.so is 1.21x to 1.31x faster than main.so!

  [127704 147993.16 218076] main.so
  [85170 117470.00 152490] pep-align-loops.so

instantiation :: cycles :: benchmarks/meshoptimizer/benchmark.wasm

  Δ = 27309.14 ± 3876.14 (confidence = 99%)

  pep-align-loops.so is 1.19x to 1.25x faster than main.so!

  [131002 152892.56 180472] main.so
  [89828 125583.42 153068] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-sieve/benchmark.wasm

  Δ = 160632987.42 ± 5972804.43 (confidence = 99%)

  pep-align-loops.so is 1.17x to 1.18x faster than main.so!

  [1039973776 1079490706.44 1137124492] main.so
  [884091222 918857719.02 940632032] pep-align-loops.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22790.88 ± 7872.59 (confidence = 99%)

  pep-align-loops.so is 1.09x to 1.19x faster than main.so!

  [151266 187293.76 236504] main.so
  [115294 164502.88 223210] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm

  Δ = 750342.94 ± 343121.40 (confidence = 99%)

  main.so is 1.07x to 1.18x faster than pep-align-loops.so!

  [4639198 5917878.50 8029610] main.so
  [5627170 6668221.44 9286080] pep-align-loops.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 40131994.18 ± 6247250.64 (confidence = 99%)

  pep-align-loops.so is 1.08x to 1.11x faster than main.so!

  [447057432 471406190.64 514834086] main.so
  [398536440 431274196.46 494554072] pep-align-loops.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 17023452.18 ± 6101055.67 (confidence = 99%)

  pep-align-loops.so is 1.05x to 1.11x faster than main.so!

  [197441876 225073129.96 300179132] main.so
  [186926662 208049677.78 259271080] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-fib2/benchmark.wasm

  Δ = 229041837.76 ± 62294164.86 (confidence = 99%)

  main.so is 1.06x to 1.10x faster than pep-align-loops.so!

  [2760966328 2877473442.90 2963631080] main.so
  [3000497042 3106515280.66 5393578372] pep-align-loops.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 680581047.42 ± 58488688.83 (confidence = 99%)

  pep-align-loops.so is 1.06x to 1.07x faster than main.so!

  [10586670916 10705050364.66 10875778390] main.so
  [9403424098 10024469317.24 10473503548] pep-align-loops.so

compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 14350847.50 ± 5439729.54 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.09x faster than main.so!

  [223399074 234169891.16 298711012] main.so
  [203019610 219819043.66 281894306] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-switch/benchmark.wasm

  Δ = 6943520.80 ± 883372.86 (confidence = 99%)

  main.so is 1.06x to 1.07x faster than pep-align-loops.so!

  [103285506 107387684.20 112425352] main.so
  [108359326 114331205.00 123381886] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm

  Δ = 4700652.32 ± 4147779.68 (confidence = 99%)

  pep-align-loops.so is 1.01x to 1.11x faster than main.so!

  [78591068 87546072.66 131549298] main.so
  [74343754 82845420.34 119083402] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm

  Δ = 32031030.76 ± 10733541.38 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.07x faster than main.so!

  [560368552 620118062.02 724321414] main.so
  [540612750 588087031.26 667213416] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-gimli/benchmark.wasm

  Δ = 282016.40 ± 251006.55 (confidence = 99%)

  pep-align-loops.so is 1.01x to 1.10x faster than main.so!

  [4628250 5487034.24 7729560] main.so
  [4451722 5205017.84 7379394] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-ctype/benchmark.wasm

  Δ = 40770429.86 ± 12108951.91 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.07x faster than main.so!

  [745772694 796438302.30 1120871064] main.so
  [724994240 755667872.44 819191920] pep-align-loops.so

compilation :: cycles :: benchmarks/meshoptimizer/benchmark.wasm

  Δ = 45780058.20 ± 10791201.41 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.06x faster than main.so!

  [906378970 982754713.94 1051226246] main.so
  [864821450 936974655.74 1001180966] pep-align-loops.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 14707.38 ± 6395.08 (confidence = 99%)

  main.so is 1.03x to 1.07x faster than pep-align-loops.so!

  [268056 314810.08 363494] main.so
  [290870 329517.46 391544] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-switch/benchmark.wasm

  Δ = 10555932.40 ± 5917930.47 (confidence = 99%)

  pep-align-loops.so is 1.02x to 1.07x faster than main.so!

  [233679212 251624084.92 304716704] main.so
  [224018282 241068152.52 297484020] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-ed25519/benchmark.wasm

  Δ = 352339176.98 ± 54555145.59 (confidence = 99%)

  main.so is 1.04x to 1.05x faster than pep-align-loops.so!

  [7724838794 8123117651.70 8693324850] main.so
  [8171934034 8475456828.68 9360323352] pep-align-loops.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 354833917.80 ± 46656850.60 (confidence = 99%)

  pep-align-loops.so is 1.04x to 1.05x faster than main.so!

  [8333427642 8745218626.02 9047945530] main.so
  [8068211288 8390384708.22 8775241152] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-nestedloop/benchmark.wasm

  Δ = 546182.80 ± 492529.24 (confidence = 99%)

  main.so is 1.00x to 1.05x faster than pep-align-loops.so!

  [19363238 21348182.48 26412730] main.so
  [20464770 21894365.28 27208398] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-seqhash/benchmark.wasm

  Δ = 176283598.14 ± 139483064.51 (confidence = 99%)

  pep-align-loops.so is 1.00x to 1.04x faster than main.so!

  [8398578544 8851817325.94 11972696930] main.so
  [8364245310 8675533727.80 9358185058] pep-align-loops.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 1365364.86 ± 787721.73 (confidence = 99%)

  main.so is 1.01x to 1.03x faster than pep-align-loops.so!

  [70943584 74199180.56 82400190] main.so
  [71201168 75564545.42 82606298] pep-align-loops.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 9692222.58 ± 5965360.16 (confidence = 99%)

  pep-align-loops.so is 1.01x to 1.03x faster than main.so!

  [512133534 550535143.06 606015082] main.so
  [525340868 540842920.48 594576292] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-memmove/benchmark.wasm

  Δ = 12198530.88 ± 6927998.27 (confidence = 99%)

  pep-align-loops.so is 1.01x to 1.02x faster than main.so!

  [969948872 1013578523.02 1056198440] main.so
  [963913634 1001379992.14 1085331408] pep-align-loops.so

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 26597456.42 ± 15817627.15 (confidence = 99%)

  pep-align-loops.so is 1.00x to 1.01x faster than main.so!

  [2879794220 2998028594.06 3072008392] main.so
  [2868685298 2971431137.64 3022400318] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-random/benchmark.wasm

  Δ = 3563643.70 ± 2705509.79 (confidence = 99%)

  pep-align-loops.so is 1.00x to 1.01x faster than main.so!

  [420296100 436272081.54 455483584] main.so
  [417633288 432708437.84 443617890] pep-align-loops.so

execution :: cycles :: benchmarks/meshoptimizer/benchmark.wasm

  Δ = 73245595.48 ± 34087034.02 (confidence = 99%)

  main.so is 1.00x to 1.01x faster than pep-align-loops.so!

  [14717521792 14913374448.16 15153257700] main.so
  [14443769602 14986620043.64 15348463668] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-ackermann/benchmark.wasm

  No difference in performance.

  [578 911.20 1802] main.so
  [612 1031.22 8942] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-minicsv/benchmark.wasm

  No difference in performance.

  [7502848 13805476.28 39798836] main.so
  [8415816 15376491.50 58056258] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-gimli/benchmark.wasm

  No difference in performance.

  [5381860 10100045.44 48285882] main.so
  [5942282 10878667.54 48743726] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-random/benchmark.wasm

  No difference in performance.

  [64877814 74104822.06 121903328] main.so
  [62363820 69867149.78 116545812] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-nestedloop/benchmark.wasm

  No difference in performance.

  [65943918 74239996.54 118916496] main.so
  [62782122 70052890.42 113066320] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-fib2/benchmark.wasm

  No difference in performance.

  [67572654 75422952.08 125960446] main.so
  [64572256 71751343.76 113836114] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-ctype/benchmark.wasm

  No difference in performance.

  [70164780 79981833.24 124235150] main.so
  [67924384 76141502.30 115786354] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-base64/benchmark.wasm

  No difference in performance.

  [72038894 81185063.32 129827844] main.so
  [68803488 77398064.10 130464834] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-matrix/benchmark.wasm

  No difference in performance.

  [72868154 80303346.42 127793726] main.so
  [68110602 76605064.42 124753786] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-ratelimit/benchmark.wasm

  No difference in performance.

  [72689994 81431019.66 135752378] main.so
  [68140454 77693645.82 130912614] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-heapsort/benchmark.wasm

  No difference in performance.

  [26242356 31726226.54 68019788] main.so
  [25816302 33089651.02 85588982] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-sieve/benchmark.wasm

  No difference in performance.

  [65362790 73500736.92 113216940] main.so
  [62491524 70480387.38 110345606] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-xchacha20/benchmark.wasm

  No difference in performance.

  [20680092 27025758.84 73571410] main.so
  [19230842 25955927.42 66535212] pep-align-loops.so

compilation :: cycles :: benchmarks/noop/benchmark.wasm

  No difference in performance.

  [18952042 24884450.06 71273452] main.so
  [19184160 25729021.62 77731242] pep-align-loops.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [255170 315186.80 523056] main.so
  [250750 306869.72 419866] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-xblabla20/benchmark.wasm

  No difference in performance.

  [20801608 27741242.94 72839798] main.so
  [21418538 27038568.34 67842886] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-keccak/benchmark.wasm

  No difference in performance.

  [23462788 27143072.44 31101670] main.so
  [23014634 26664234.80 29684924] pep-align-loops.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [5555498 6112458.12 9056580] main.so
  [5469206 6006283.60 7598592] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-memmove/benchmark.wasm

  No difference in performance.

  [21754288 28852856.96 81786320] main.so
  [19866200 28369634.00 73285878] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-xblabla20/benchmark.wasm

  No difference in performance.

  [2697934 3460812.74 4591972] main.so
  [2724726 3505271.82 4687546] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-minicsv/benchmark.wasm

  No difference in performance.

  [1373221444 1436425315.94 1615964750] main.so
  [1388811158 1448320129.58 1544200032] pep-align-loops.so

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [33413772 60278591.50 104415598] main.so
  [50810008 60713036.70 105751084] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-matrix/benchmark.wasm

  No difference in performance.

  [315938540 329069574.94 339785392] main.so
  [316824444 330753052.98 350882584] pep-align-loops.so

execution :: cycles :: benchmarks/noop/benchmark.wasm

  No difference in performance.

  [510 673.54 1054] main.so
  [476 670.14 1428] pep-align-loops.so

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [244188 270355.76 407048] main.so
  [241162 269289.52 425204] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-keccak/benchmark.wasm

  No difference in performance.

  [112310364 121507460.56 164459632] main.so
  [110807054 121028386.62 162229980] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-base64/benchmark.wasm

  No difference in performance.

  [315356358 329028939.50 338636940] main.so
  [320483456 330305920.34 371309172] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-heapsort/benchmark.wasm

  No difference in performance.

  [2745001322 2833681232.10 2874346230] main.so
  [2713209860 2826260269.36 2874107210] pep-align-loops.so

compilation :: cycles :: benchmarks/shootout-seqhash/benchmark.wasm

  No difference in performance.

  [20256044 27961717.30 72723518] main.so
  [20253800 27918497.18 69437282] pep-align-loops.so

execution :: cycles :: benchmarks/shootout-ratelimit/benchmark.wasm

  No difference in performance.

  [30982738 33136551.30 38943124] main.so
  [31444424 33104340.72 42179074] pep-align-loops.so

I haven't figured out how to write a test though since the test compile precise-output seems to be ignoring nops. I assume that's intentional. If so, what would be the best way to introduce one?

bjorn3 · 2022-10-04T11:26:50Z

cranelift/codegen/src/isa/x64/inst/mod.rs

+            // Unaligned loop headers can cause severe performance problems.
+            // See https://github.com/bytecodealliance/wasmtime/issues/4883.
+            // Here we use conservative 64 bytes alignment.
+            align_to(offset, 64)


Could you disable this when optimizing for size? Also won't this cause a large blowup when there are a lot of tiny loops in a function with multiple being able to fit in a single cacheline?

Also maybe disable it for cold blocks?

bjorn3 · 2022-10-04T11:29:06Z

cranelift/codegen/src/machinst/blockorder.rs

@@ -106,6 +106,8 @@ pub struct BlockLoweringOrder {
    /// which is used by VCode emission to sink the blocks at the last
    /// moment (when we actually emit bytes into the MachBuffer).
    cold_blocks: FxHashSet<BlockIndex>,
+    /// These are loop headers. Used for alignment.
+    loop_headers: FxHashSet<BlockIndex>,


Would it make sense to merge cold_blocks and loop_headers into a single SecondaryMap with a bitflag as element? Or are both still too infrequently used such that it causes a large increase in memory usage?

cfallin · 2022-10-04T18:17:15Z

@pepyakin , thanks for this change. However, I'm not sure that we want to take it as-is; or at least, I'd like to see more data.

The Sightglass runs you provide are all over the place -- in some cases this is better by 5-10%, in other cases main is better by 5-10%. There doesn't appear to be a clear trend to me.

What's more concerning, the instantiation times have wild swings as well, but this shouldn't have been affected at all by this change. This makes me concerned about the reliability of the measurement setup and confidence-interval computation.

Would you be willing to take a few of the larger benchmarks (spidermonkey and bz2, say) and run them by precompiling to .cwasms then measuring time with your tool of choice (I like hyperfine) for a wasmtime run ... command? I'm curious how much variance this sees and what trends we'll get.

I share @bjorn3's concern about code size here as well. Aligning every loop to a 64-byte granularity is likely to bloat some cases nontrivially. Could you report the effect on code-size (the .text segment of the .cwasms) for benchmarks as well?

pepyakin · 2022-10-11T14:49:14Z

That's a good catch. I must admit that I did not notice that it was about instantiation! Probably something was running on my test machine and ruined the measurement. I also agree with the points made by @bjorn3. In retrospect, that seems obvious 🤦

I had an impression that this PR would be a quick one, but it seems I was wrong = ) I will try to further it, but that won't be my primary focus, so it will likely take some time¹.

Should anyone feel the acute need/or just want to take over for other reasons, feel free to take it on. ↩

lpereira · 2024-05-21T22:18:11Z

Doing some necromancing here, but I found a nice article detailing how the .NET JIT is padding their loops and adopting a similar strategy might be useful. It's quite a bit more involved than this PR.

cfallin · 2024-05-21T22:34:21Z

@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate? It'd be helpful to see a summary of the technique given your background with the .NET JIT.

The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments). We are also an AOT compiler (even in JIT mode, we are morally AOT because we compile once at load-time) so we can't pick based on profiling counters. So we need some heuristics for loops that are hot enough for this to matter, yet infrequent enough that the size impact (and fetch bandwidth impact on the entry to the loop) is acceptable.

(EDIT: to clarify, I'm hoping to learn more about e.g. "block-weight meets a certain weight threshold" in that article: is a block-weight static or dynamic, and how is it measured?)

lpereira · 2024-05-23T00:27:50Z

@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate?

I don't know!

What I know, however, is that CoreCLR has a tiered JIT: AOT compilation is possible and usually performed to speed up startup, but the compiler in AOT mode does as much work optimizing code as Winch, IIRC. Once it detects that a function needs a second look (and I don't know how it does, maybe a per-function counter and some tiny code in the epilogue that decrements and calls into a "jit this harder" function that does that, then patches the caller to become a trampoline? I really don't know; it's probably something a whole lot more clever than that), it then spends more time recompiling things. This is pretty complicated to implement, and wouldn't help with the pure AOT scenarios we have, of course (unless we had some kind of PGO thing).

I'll spend some time looking through CoreCLR later to figure out exactly what they do and how they do it. I'll look at how GCC and Clang does it too.

The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments).

Right. I've given this some thought, and we could experiment with some heuristics, based on:

The number of (WASM) instructions in the loop. Loops with over, I don't know, 100 instructions in the body is likely to not be that tight.
The kind of Cranelift IR instructions in the loop. For instance:
- If it's mostly arithmetic, SIMD, and memory read/writes, then it's probably a hot loop
- If there are non-math-related libcalls, then it's less likely it's a hot loop
- If there are host calls, it's also less likely it's a hot loop
- If most calls are to small, trivial functions (say ~16 WASM instructions, no loops, maybe at most 2 branches, no other calls, or some other heuristic similar we'd use for inlining a function)
- Maybe factor in a ratio of branches/number of instructions too, and if it's not branch-heavy, it's probably a hot loop?
If we ever have some sort of PGO, feed that into the "loop hotness" score

It's all a bit fiddly, of course, and we'd need to empirically adjust these scores, but I think it's a doable strategy to identify hot loops.

Then, when lowering, at least on x86, two things need to happen if a loop is considered hot:

Align the first instruction of the loop on a 32-byte boundary if there's SIMD, and 16-byte otherwise
Ensure that no instructions in the loop crosses a cache line boundary

pepyakin requested review from cfallin and akirilov-arm October 4, 2022 10:40

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen labels Oct 4, 2022

x86_64: align loop headers to 64 bytes

a0c63f8

pepyakin force-pushed the pep-align-loops branch from 506aa59 to a0c63f8 Compare October 4, 2022 11:05

bjorn3 reviewed Oct 4, 2022

View reviewed changes

afonso360 mentioned this pull request Oct 5, 2022

fuzzgen: Generate compiler flags #5020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86_64: align loop headers to 64 bytes #5004

x86_64: align loop headers to 64 bytes #5004

pepyakin commented Oct 4, 2022

bjorn3 Oct 4, 2022

bjorn3 Oct 4, 2022

bjorn3 Oct 4, 2022

cfallin commented Oct 4, 2022

pepyakin commented Oct 11, 2022

lpereira commented May 21, 2024

cfallin commented May 21, 2024 •

edited

Loading

lpereira commented May 23, 2024

x86_64: align loop headers to 64 bytes #5004

Are you sure you want to change the base?

x86_64: align loop headers to 64 bytes #5004

Conversation

pepyakin commented Oct 4, 2022

bjorn3 Oct 4, 2022

Choose a reason for hiding this comment

bjorn3 Oct 4, 2022

Choose a reason for hiding this comment

bjorn3 Oct 4, 2022

Choose a reason for hiding this comment

cfallin commented Oct 4, 2022

pepyakin commented Oct 11, 2022

Footnotes

lpereira commented May 21, 2024

cfallin commented May 21, 2024 • edited Loading

lpereira commented May 23, 2024

cfallin commented May 21, 2024 •

edited

Loading