Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32bit base #317

Merged
merged 5 commits into from
Nov 29, 2021
Merged

Conversation

doyougnu
Copy link
Contributor

@doyougnu doyougnu commented Aug 26, 2021

This PR does two things:

  1. Swap the 16 bit base for a 32 bit base
  2. Reorder many pattern matches to hit hot code paths with less comparisons

Regarding 1. this should improve performance since a path in the tree become log_32 rather than log_16. Similarly it should mean that the HAMT stays more shallow for longer.

Note that the benchmark suite will not show the difference since n = 2^12 = 4096 which means the HAMT doesn't get very nested.

I've benchmarked the PR with n = 4 * (2^16) elements. This number is significant because with a 16 bit base then tree should increase in level after 2^16 elements are inserted (I believe). So this number of elements stress tests the HAMT and thus a difference between implementations is more noticeable. Here is a table of the comparison:

   Name                                master    branch Difference PctDifference
   <chr>                                <dbl>     <dbl>      <dbl>         <dbl>
 1 HashMap/insert/String            2.0030e-1 2.6837e-1  6.8074e-2      33.987  
 2 HashMap/alterFInsert/String      2.1753e-1 2.7419e-1  5.6655e-2      26.045  
 3 HashMap/isSubmapOf/ByteString    1.1734e-2 1.4180e-2  2.4460e-3      20.845  
 4 HashMap/lookup-miss/String       3.4322e-2 4.0775e-2  6.4534e-3      18.803  
 5 HashMap/fromList/short/Int       4.9360e-2 5.6371e-2  7.0111e-3      14.204  
 6 HashMap/alterDelete-miss/Str…    3.9093e-2 4.4139e-2  5.0463e-3      12.909  
 7 HashMap/alterInsert/Int          1.6629e-1 1.8363e-1  1.7341e-2      10.428  
 8 HashMap/fromListWith/short/I…    5.6768e-2 6.2198e-2  5.4307e-3       9.5665 
 9 HashMap/insert/ByteString        1.9137e-1 2.0795e-1  1.6575e-2       8.6611 
10 HashMap/alterFInsert/Int         1.6701e-1 1.7653e-1  9.5194e-3       5.6998 
11 HashMap/alterInsert/ByteStri…    2.1202e-1 2.2374e-1  1.1721e-2       5.5284 
12 HashMap/alterFInsert/ByteStr…    1.9252e-1 2.0284e-1  1.0322e-2       5.3617 
13 HashMap/insert/Int               1.6066e-1 1.6753e-1  6.8664e-3       4.2739 
14 HashMap/lookup/String            1.8751e-1 1.9298e-1  5.4667e-3       2.9154 
15 HashMap/alterFDelete/Int         1.2597e-1 1.2557e-1 -3.9674e-4      -0.31494
16 HashMap/delete/ByteString        1.9134e-1 1.8940e-1 -1.9429e-3      -1.0154 
17 HashMap/isSubmapOfNaive/Byte…    1.8026e-2 1.7503e-2 -5.2318e-4      -2.9024 
18 HashMap/delete/Int               1.3296e-1 1.2747e-1 -5.4821e-3      -4.1233 
19 HashMap/alterFInsert-dup/Int     1.3616e-1 1.3029e-1 -5.8736e-3      -4.3138 
20 HashMap/alterFDelete/ByteStr…    1.9870e-1 1.8918e-1 -9.5198e-3      -4.7910 
21 HashMap/alterFInsert-dup/Str…    1.0951e-1 1.0404e-1 -5.4631e-3      -4.9889 
22 HashMap/alterDelete/ByteStri…    2.0143e-1 1.9126e-1 -1.0166e-2      -5.0472 
23 HashMap/insert-dup/String        1.0905e-1 1.0261e-1 -6.4429e-3      -5.9079 
24 HashMap/map                      1.7489e-2 1.6364e-2 -1.1255e-3      -6.4352 
25 HashMap/alterDelete/Int          1.4028e-1 1.3015e-1 -1.0133e-2      -7.2233 
26 HashMap/difference               2.9006e-2 2.6571e-2 -2.4352e-3      -8.3956 
27 HashMap/fromList/long/String     1.2460e-1 1.1381e-1 -1.0790e-2      -8.6593 
28 HashMap/alterInsert/String       2.3628e-1 2.1523e-1 -2.1044e-2      -8.9067 
29 HashMap/lookup/ByteString        9.0018e-2 8.1684e-2 -8.3336e-3      -9.2577 
30 HashMap/alterInsert-dup/Int      1.3935e-1 1.2553e-1 -1.3822e-2      -9.9193 
31 HashMap/fromList/long/Int        7.9622e-2 7.1558e-2 -8.0643e-3     -10.128  
32 HashMap/insert-dup/Int           1.3674e-1 1.2280e-1 -1.3940e-2     -10.194  
33 HashMap/alterFDelete-miss/Int    7.3169e-2 6.4553e-2 -8.6158e-3     -11.775  
34 HashMap/fromListWith/short/S…    2.6485e-2 2.3176e-2 -3.3088e-3     -12.493  
35 HashMap/fromListWith/long/By…    9.8163e-2 8.5733e-2 -1.2430e-2     -12.663  
36 HashMap/alterDelete-miss/Int     7.8475e-2 6.8417e-2 -1.0058e-2     -12.817  
37 HashMap/intersection             2.9143e-2 2.5119e-2 -4.0241e-3     -13.808  
38 HashMap/fromListWith/long/Int    8.8757e-2 7.6069e-2 -1.2688e-2     -14.295  
39 HashMap/size/ByteString          3.6860e-3 3.0649e-3 -6.2114e-4     -16.851  
40 HashMap/lookup/Int               5.8667e-2 4.8705e-2 -9.9625e-3     -16.981  
41 HashMap/delete-miss/Int          7.1066e-2 5.8884e-2 -1.2182e-2     -17.141  
42 HashMap/size/String              3.1747e-3 2.5993e-3 -5.7544e-4     -18.126  
43 HashMap/fromList/short/ByteS…    2.3714e-2 1.9378e-2 -4.3362e-3     -18.285  
44 HashMap/filterWithKey            6.7425e-3 5.4882e-3 -1.2543e-3     -18.603  
45 HashMap/alterInsert-dup/Byte…    9.9778e-2 8.1030e-2 -1.8747e-2     -18.789  
46 HashMap/fromListWith/short/B…    2.2141e-2 1.7960e-2 -4.1809e-3     -18.883  
47 HashMap/foldl'                   3.8933e-3 3.1439e-3 -7.4942e-4     -19.249  
48 HashMap/filter                   1.3389e-2 1.0733e-2 -2.6563e-3     -19.839  
49 HashMap/fromList/long/ByteSt…    1.1492e-1 9.1809e-2 -2.3114e-2     -20.112  
50 HashMap/fromListWith/long/St…    1.2993e-1 1.0327e-1 -2.6664e-2     -20.521  
51 HashMap/insert-dup/ByteString    9.4205e-2 7.4592e-2 -1.9613e-2     -20.820  
52 HashMap/alterFInsert-dup/Byt…    9.4634e-2 7.4397e-2 -2.0237e-2     -21.384  
53 HashMap/lookup-miss/Int          3.3997e-2 2.6709e-2 -7.2885e-3     -21.438  
54 HashMap/union                    1.1522e-2 9.0485e-3 -2.4736e-3     -21.468  
55 HashMap/size/Int                 1.1961e-3 9.2380e-4 -2.7234e-4     -22.768  
56 HashMap/delete-miss/String       3.4340e-2 2.6487e-2 -7.8529e-3     -22.868  
57 HashMap/alterFDelete-miss/St…    3.4657e-2 2.6152e-2 -8.5043e-3     -24.539  
58 HashMap/alterFDelete-miss/By…    2.6543e-2 1.9803e-2 -6.7402e-3     -25.393  
59 HashMap/fromList/short/String    3.9645e-2 2.9554e-2 -1.0091e-2     -25.454  
60 HashMap/alterDelete-miss/Byt…    2.8437e-2 2.1007e-2 -7.4294e-3     -26.126  
61 HashMap/delete-miss/ByteStri…    2.7414e-2 1.9998e-2 -7.4164e-3     -27.053  
62 HashMap/lookup-miss/ByteStri…    2.4133e-2 1.7398e-2 -6.7349e-3     -27.907  
63 HashMap/alterFDelete/String      2.9774e-1 2.1209e-1 -8.5647e-2     -28.765  
64 HashMap/delete/String            2.9461e-1 2.0885e-1 -8.5761e-2     -29.110  
65 HashMap/alterDelete/String       3.0336e-1 2.1491e-1 -8.8451e-2     -29.157  
66 HashMap/alterInsert-dup/Stri…    1.9579e-1 1.0271e-1 -9.3082e-2     -47.541  
67 HashMap/foldr                    4.7994e-3 2.1380e-3 -2.6615e-3     -55.454  
68 HashMap/isSubmapOfNaive/Stri…    5.4429e-2 2.0382e-2 -3.4047e-2     -62.552  
69 HashMap/isSubmapOf/Int           3.3608e-7 1.2071e-7 -2.1537e-7     -64.084  
70 HashMap/isSubmapOf/String        5.2910e-2 1.7840e-2 -3.5070e-2     -66.283  
71 HashMap/isSubmapOfNaive/Int      1.6283e-7 3.8168e-8 -1.2467e-7     -76.560  

Positive numbers are slowdowns and negative numbers are speed ups. I'm unsure why some of the larger slowdowns occur and am hoping you might have some insight here. In any case I think the speedups are worth investing time, for example HashMap/foldr is 55% faster and HashMap/lookup/Int is 16% faster by these benchmarks!

@konsumlamm
Copy link
Contributor

konsumlamm commented Aug 26, 2021

Regarding 1. this should improve performance since a path in the tree become log_32 rather than log_16. Similarly it should mean that the HAMT stays more shallow for longer.

It shouldn't necessarily improve performance (the complexity for write operations should be O(m * log_m(n)), where m is the beanching factor), for example in my rrb-vector package (RRB-Vectors are based on a similar structure as HAMTs), changing the branching factor from 16 to 32 didn't have any noticable impact (at least in my naive benchmarks).

   Name                                master    branch Difference PctDifference
   <chr>                                <dbl>     <dbl>      <dbl>         <dbl>

What exactly do these columns mean? This is not clear to me.

@treeowl
Copy link
Collaborator

treeowl commented Aug 26, 2021

Broadly speaking, I'd expect a larger branching factor to be good for reads and bad for writes. I'm surprised that not all the regressions involve building/modifying maps.

@doyougnu
Copy link
Contributor Author

Regarding 1. this should improve performance since a path in the tree become log_32 rather than log_16. Similarly it should mean that the HAMT stays more shallow for longer.

It shouldn't necessarily improve performance, for example in my rrb-vector package (RRB-Vectors are based on a similar structure as HAMTs), changing the branching factor from 16 to 32 didn't have any noticable impact (at least in my naive benchmarks).

   Name                                master    branch Difference PctDifference
   <chr>                                <dbl>     <dbl>      <dbl>         <dbl>

What exactly do these columns mean? This is not clear to me.

I used an R script to generate the table. Name is the name from the bgroup + bench label in the criterion benchmark. The master column is the Mean from master for that benchmark (so this is the baseline), branch is the 32bit branch implementation, Difference is the raw difference between the two, i.e., master - branch and PctDifference is the percent difference maintaining the sign, i.e., ((master - branch) / master) * 100. Then second row <chr> ... <dbl> are the types of these columns as read by R. This row can be ignored.

@sjakobi sjakobi linked an issue Aug 27, 2021 that may be closed by this pull request
@sgraf812
Copy link

I'd strongly suggest to compile with -fproc-alignment=64 to avoid code layout flukes. I've seen improvements/varitations of more than 30% by doing so.

@doyougnu
Copy link
Contributor Author

doyougnu commented Aug 27, 2021

I'd strongly suggest to compile with -fproc-alignment=64 to avoid code layout flukes. I've seen improvements/varitations of more than 30% by doing so.

Updated table with both master and branch using this option. I also shifted the raw values to milliseconds so HashMap/lookup-miss/String is 32.7ms in the raw measurement for master.

Lastly, I added the Faster column which is just (old / new) * 100, so HashMap/alterFDelete-miss/String is 78% "faster", since 78 < 100 that's slower by 22%. whereas HashMap/foldl is 119.69% faster which is just a speedup of ~20%.

                                     Name            master             branch          Difference  PctDifference       Faster
 1:      HashMap/alterFDelete-miss/String  32.9696290684921  42.39932566484126    9.42969659634921  28.6011607130  77.75979583
 2:           HashMap/alterFInsert/String 215.3235145868651 275.82584350734123   60.50232892047613  28.0983379992  78.06502532
 3:         HashMap/isSubmapOf/ByteString  11.5999402036724  14.22520081024892    2.62526060657648  22.6316736163  81.54500143
 4:        HashMap/fromListWith/short/Int  55.6070548667460  67.94833102138890   12.34127615464286  22.1937237716  81.83726374
 5:            HashMap/fromList/short/Int  48.6182299250794  59.06959459456349   10.45136466948412  21.4968020958  82.30669308
 6:       HashMap/alterDelete-miss/String  37.3981422372222  44.85334607960317    7.45520384238095  19.9346903252  83.37871197
 7:                 HashMap/insert/String 226.1796030594048 265.25639724075393   39.07679418134919  17.2768868867  85.26829340
 8:             HashMap/insert/ByteString 185.3351694798413 209.08809480531747   23.75292532547618  12.8161996410  88.63975237
 9:               HashMap/alterInsert/Int 158.9757773168254 178.35194024710316   19.37616293027777  12.1881227803  89.13599543
10:       HashMap/alterFInsert/ByteString 183.1856731308730 204.64208483416667   21.45641170329363  11.7129311133  89.51515192
11:            HashMap/alterInsert/String 204.8049832366270 222.53991815833334   17.73493492170633   8.6594254893  92.03067249
12:                    HashMap/insert/Int 156.6099045867460 166.19426213369047    9.58435754694445   6.1198923352  94.23303944
13:        HashMap/alterInsert/ByteString 214.4329253058333 222.34184867607146    7.90892337023813   3.6882971022  96.44289934
14:              HashMap/alterFInsert/Int 163.5382111585318 168.70286593051586    5.16465477198411   3.1580721933  96.93860875
15:       HashMap/alterFDelete/ByteString 185.6910126868254 191.27941759849205    5.58840491166665   3.0095182480  97.07840761
16:               HashMap/alterDelete/Int 124.9060165442460 126.09339499865079    1.18737845440478   0.9506175021  99.05833414
17:             HashMap/delete/ByteString 183.2686007890079 184.99019341785714    1.72159262884922   0.9393822081  99.06936006
18:                    HashMap/delete/Int 124.1260541469444 124.87457840769841    0.74852426075396   0.6030355721  99.40057915
19:              HashMap/alterFDelete/Int 121.9434335053571 121.66067152765874   -0.28276197769841  -0.2318796261 100.23241856
20:                 HashMap/lookup/String 189.7499243107143 188.51114347603175   -1.23878083468254  -0.6528491851 100.65713931
21:    HashMap/isSubmapOfNaive/ByteString  17.6835346280844  17.37498814341991   -0.30854648466450  -1.7448235952 101.77580832
22:      HashMap/fromList/long/ByteString  96.4228993230952  94.48711075726190   -1.93578856583333  -2.0076025295 102.04873294
23:                           HashMap/map  16.8987048340729  16.44513348254329   -0.45357135152958  -2.6840598495 102.75808860
24:  HashMap/fromListWith/long/ByteString  94.2372655196032  91.41055523039684   -2.82671028920635  -2.9995673937 103.09232373
25:      HashMap/fromListWith/long/String 111.7402754723016 107.75569247095237   -3.98458300134921  -3.5659326814 103.69779351
26:          HashMap/alterFInsert-dup/Int 131.4835226894841 126.06226711468254   -5.42125557480158  -4.1231444548 104.30045857
27:          HashMap/fromList/long/String 123.0455838120635 117.13239325996032   -5.91319055210318  -4.8056910040 105.04829654
28:             HashMap/fromList/long/Int  78.2291642735714  74.07452190690476   -4.15464236666667  -5.3108612437 105.60873329
29:                HashMap/insert-dup/Int 131.0241765444048 123.93787064400793   -7.08630590039683  -5.4083956773 105.71762760
30:         HashMap/fromList/short/String  33.4538726243254  31.55743524325397   -1.89643738107143  -5.6688127033 106.00947880
31:         HashMap/fromListWith/long/Int  85.2876245838492  80.34853845690476   -4.93908612694445  -5.7910935508 106.14707650
32:        HashMap/alterDelete/ByteString 193.6223339088889 182.08000879123014  -11.54232511765874  -5.9612570950 106.33915013
33:             HashMap/lookup/ByteString  86.0702200210714  80.42905462166667   -5.64116539940476  -6.5541431148 107.01384024
34:           HashMap/alterInsert-dup/Int 133.8595860533730 124.46328831214286   -9.39629774123017  -7.0195180026 107.54945323
35:         HashMap/alterFDelete-miss/Int  68.6103543238095  62.53517027047620   -6.07518405333334  -8.8546169353 109.71482772
36:                    HashMap/lookup/Int  56.5716651693254  51.19610434095237   -5.37556082837302  -9.5022142486 110.49994115
37:     HashMap/fromList/short/ByteString  23.6204318465152  21.23089338493867   -2.38953846157648 -10.1164046327 111.25500665
38:                  HashMap/intersection  28.1832533207576  25.27620454062410   -2.90704878013348 -10.3148091068 111.50112856
39:     HashMap/fromListWith/short/String  27.2837814395671  24.22216661195166   -3.06161482761544 -11.2213727939 112.63972326
40:                    HashMap/difference  29.7513369804654  26.15295276516595   -3.59838421529943 -12.0948655775 113.75899788
41:               HashMap/delete-miss/Int  68.4862471348413  59.07717965416666   -9.40906748067461 -13.7386232628 115.92673776
42:          HashMap/alterDelete-miss/Int  75.5412594859127  65.02767487952381  -10.51358460638889 -13.9176718497 116.16786180
43:    HashMap/alterInsert-dup/ByteString  95.5499394222619  81.68777050087301  -13.86216892138889 -14.5077736367 116.96969918
44:         HashMap/insert-dup/ByteString  89.1090322516270  75.89503456515872  -13.21399768646826 -14.8290216520 117.41088566
45:                        HashMap/filter  12.8951041891522  10.93289972755772   -1.96220446159452 -15.2166623302 117.94770382
46: HashMap/fromListWith/short/ByteString  22.1765061313709  18.65256742211039   -3.52393870926046 -15.8904143348 118.89251292
47:               HashMap/lookup-miss/Int  32.7953670787302  27.54795450480158   -5.24741257392857 -16.0004690947 119.04828387
48:                        HashMap/foldl'   3.8794029730199   3.24104510120498   -0.63835787181497 -16.4550544569 119.69605025
49:   HashMap/alterFInsert-dup/ByteString  90.0987970607936  75.18020872607143  -14.91858833472222 -16.5580327611 119.84377084
50:               HashMap/size/ByteString   3.6285647652010   3.02743852101209   -0.60112624418887 -16.5665017186 119.85593564
51:            HashMap/lookup-miss/String  32.7155993463889  26.91275673910534   -5.80284260728355 -17.7372346013 121.56168045
52:                 HashMap/filterWithKey   6.6671753330803   5.42487430954795   -1.24230102353230 -18.6330936486 122.90008860
53:            HashMap/delete-miss/String  32.8114436606349  26.68327159129870   -6.12817206933622 -18.6769351959 122.96634447
54:             HashMap/insert-dup/String 132.8011339937301 106.71809397821427  -26.08304001551587 -19.6406756713 124.44106622
55:                      HashMap/size/Int   1.1796196262251   0.93487723845725   -0.24474238776785 -20.7475683116 126.17909365
56:        HashMap/lookup-miss/ByteString  22.6591833992063  17.51742733745671   -5.14175606174964 -22.6917094547 129.35223285
57:                         HashMap/union  11.6413152333117   8.97725471822205   -2.66406051508963 -22.8845320455 129.67567033
58:        HashMap/delete-miss/ByteString  25.7546491999820  19.79242582878067   -5.96222337120129 -23.1500857376 130.12376261
59:  HashMap/alterFDelete-miss/ByteString  25.6176592040224  19.43847852450938   -6.17918067951299 -24.1207857061 131.78839677
60:   HashMap/alterDelete-miss/ByteString  27.3690040863492  20.72142463432900   -6.64757945202020 -24.2887151869 132.08070666
61:                   HashMap/size/String   3.1031935541833   2.28104357661803   -0.82214997756530 -26.4936737980 136.04271247
62:                 HashMap/delete/String 291.2628381117063 208.91650081448412  -82.34633729722221 -28.2721743121 139.41590874
63:            HashMap/alterDelete/String 300.6992605804762 214.01811996396827  -86.68114061650792 -28.8265227022 140.50177650
64:           HashMap/alterFDelete/String 290.0470790068651 203.62735095567461  -86.41972805119050 -29.7950692512 142.44013766
65:       HashMap/alterFInsert-dup/String 132.8972828945238  83.72508800170634  -49.17219489281744 -37.0001506591 158.73053832
66:        HashMap/alterInsert-dup/String 196.0290256689286 101.35723343103173  -94.67179223789685 -48.2947828337 193.40408083
67:                         HashMap/foldr   4.5503249296860   2.15623200491741   -2.39409292476860 -52.6136696118 211.03132313
68:        HashMap/isSubmapOfNaive/String  52.9287307866270  19.65998594744228  -33.26874483918471 -62.8557389243 269.22059318
69:             HashMap/isSubmapOf/String  51.4585838697619  19.10838016412338  -32.35020370563851 -62.8664865468 269.29851420
70:                HashMap/isSubmapOf/Int   0.0003249014255   0.00011550580476   -0.00020939562073 -64.4489695330 281.28579871
71:           HashMap/isSubmapOfNaive/Int   0.0001321604130   0.00003822007067   -0.00009394034228 -71.0805453635 345.78798687

@doyougnu
Copy link
Contributor Author

doyougnu commented Aug 27, 2021

Just pushed some fixes that improve the performance of some of the benchmarks notably insert-dup/String and insert/String. The fixes simply preserve the bang patterns in leading pattern matches on most of the go closures in the code. When I reordered the pattern matches I overlooked this and it is (rightfully) impactful on performance.

I've updated the data table in my previous comment with data that reflects this change

@sjakobi
Copy link
Member

sjakobi commented Aug 28, 2021 via email

@sjakobi
Copy link
Member

sjakobi commented Sep 14, 2021

I think these numbers look pretty good. I'm surprised about the rearranging
of pattern matches though. So far I had believed that GHC ignores this and
simply uses the order of the constructors from the data definition.

I tried to check this with a simplistic example of two identical functions except for different orders of pattern matches. GHC seems to treat these functions as identical and CSE's them:

module M where

data D
  = A [Int]
  | B
  | C Int

f d = case d of
  A xs -> sum xs
  B -> 0
  C x -> x

-- In the Core this definition is CSE'd as
-- g = f
g d = case d of
  C x -> x
  B -> 0
  A xs -> sum xs

So I either need some convincing that the reordering of pattern matches actually affects performance, or I'd like these changes to be reverted.

@doyougnu If you need anything from me, please shout! :)

@doyougnu
Copy link
Contributor Author

doyougnu commented Sep 23, 2021

@sjakobi Apologies for taking so long to reply. You're absolutely right regarding the CSE! I had to double check the core to be sure, reordering the definition of the data constructors does change the pattern match order in core and stg. I've had some success with this but I'll open a new PR if I feel that its worth it.

Regarding the performance from my first to second version. I think this probably has more to do with the bang patterns I added on BitmapIndexed cases. Should I revert these bang patterns as well?

go h k _ t@(Leaf hy (L ky _))
| hy == h && ky == k = Empty
| otherwise = t
go h k s t@(BitmapIndexed b ary)
go !h !k !s t@(BitmapIndexed b ary)
| b .&. m == 0 = t
| otherwise =
let !st = A.index ary i
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the net effect of all these bang pattern changes? Do they change compiled Core? Do they change demand signatures, unfoldings, or unfolding guidance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These do change compiled Core. I just checked insert and delete. There are positive changes with insert. The branche's insert loop fuses:

Data.HashMap.Internal.$winsert'
  = \ (@ k_sGwF)
      (@ v_sGwG)
      (w_sGwH :: Eq k_sGwF)
      (ww_sGwO :: Exts.Word#)
      (w1_sGwJ :: k_sGwF)
      (w2_sGwK :: v_sGwG)
      (w3_sGwL :: HashMap k_sGwF v_sGwG) ->
      letrec {
        $s$wgo_sP6x [Occ=LoopBreaker]
          :: HashMap k_sGwF v_sGwG
             -> Int# -> v_sGwG -> k_sGwF -> Exts.Word# -> HashMap k_sGwF v_sGwG
        [LclId, Arity=5, Str=<S,1*U><L,U><L,U><L,U><L,U>, Unf=OtherCon []]
        $s$wgo_sP6x
          = \ (sc_sP6w :: HashMap k_sGwF v_sGwG)
              (sc1_sP6v :: Int#)
              (sc2_sP6u :: v_sGwG)
              (sc3_sP6t :: k_sGwF)
              (sc4_sP6s :: Exts.Word#) ->
              case sc_sP6w of wild_X8H {
...
                  }
              }; } in
      $s$wgo_sP6x w3_sGwL 0# w2_sGwK w1_sGwJ ww_sGwO

note only 1 wgo.... defined in the letrec, master uses two:

Data.HashMap.Internal.$winsert'
  = \ (@ k_sGhe)
      (@ v_sGhf)
      (w_sGhg :: Eq k_sGhe)
      (ww_sGhn :: Exts.Word#)
      (w1_sGhi :: k_sGhe)
      (w2_sGhj :: v_sGhf)
      (w3_sGhk :: HashMap k_sGhe v_sGhf) ->
      letrec {
        $s$wgo_sOLH [Occ=LoopBreaker]
          :: Exts.Word#
             -> Exts.SmallArray# (HashMap k_sGhe v_sGhf)
             -> Int#
             -> v_sGhf
             -> k_sGhe
             -> Exts.Word#
             -> HashMap k_sGhe v_sGhf
        [LclId,
         Arity=6,
         Str=<L,U><L,U><L,U><L,U><S,1*U><L,U>,
         Unf=OtherCon []]
        $s$wgo_sOLH
          = \ (sc_sOLF :: Exts.Word#)
              (sc1_sOLG :: Exts.SmallArray# (HashMap k_sGhe v_sGhf))
              (sc2_sOLE :: Int#)
              (sc3_sOLD :: v_sGhf)
              (sc4_sOLC :: k_sGhe)
              (sc5_sOLB :: Exts.Word#) ->
              case sc4_sOLC of k1_XcAI { __DEFAULT ->
...
        $wgo1_sGhd
          = \ (ww1_sGh7 :: Exts.Word#)
              (w4_sGh1 :: k_sGhe)
              (w5_sGh2 :: v_sGhf)
              (ww2_sGhb :: Int#)
              (w6_sGh4 :: HashMap k_sGhe v_sGhf) ->
              case w4_sGh1 of k1_XcAI { __DEFAULT ->
              case w6_sGh4 of wild_X8X {
...
              }
              }; } in
      $wgo1_sGhd ww_sGhn w1_sGhi w2_sGhj 0# w3_sGhk

where go_sGhd calls go_sOLH in the Collision case. This is the case that gets fused on the branch.

For whatever reason the bang on delete stops the worker/wrapper transformation:
branch:

delete'
  = \ (@ k_agXw)
      (@ v_agXx)
      ($dEq_agXz :: Eq k_agXw)
      (h0_acCw :: Hash)
      (k0_acCx :: k_agXw)
      (m0_acCy :: HashMap k_agXw v_agXx) ->
      case k0_acCx of k1_XcG7 { __DEFAULT ->
      letrec {
        $sgo_sOOZ [Occ=LoopBreaker]
          :: HashMap k_agXw v_agXx
             -> Int# -> k_agXw -> Exts.Word# -> HashMap k_agXw v_agXx
        [LclId, Arity=4, Str=<S,1*U><L,U><L,U><L,U>, Unf=OtherCon []]
        $sgo_sOOZ

master:

Data.HashMap.Internal.$wdelete'
  = \ (@ k_sGya)
      (@ v_sGyb)
      (w_sGyc :: Eq k_sGya)
      (ww_sGyi :: Exts.Word#)
      (w1_sGye :: k_sGya)
      (w2_sGyf :: HashMap k_sGya v_sGyb) ->
      letrec {
        $wgo1_sGy9 [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
          :: Exts.Word#
             -> k_sGya -> Int# -> HashMap k_sGya v_sGyb -> HashMap k_sGya v_sGyb
        [LclId, Arity=4, Str=<L,U><S,1*U><L,U><S,1*U>, Unf=OtherCon []]
        $wgo1_sGy9

Notice also that on master the Hash input gets unboxed due to the WW transformation.

I think that making sure delete gets Worker/Wrapper'd before merging is a good idea. I'm also fine with just reverting the bangs and opening another PR for any more changes.

@treeowl
Copy link
Collaborator

treeowl commented Sep 24, 2021

I think it would be better to do a separate PR, so we don't confuse where each perf change comes from. Sometimes GHC doesn't W/W something because it decides to inline it always; haven't looked at whether that applies here.

@doyougnu doyougnu force-pushed the 32bit-base branch 3 times, most recently from 2d0928f to af3f3ff Compare September 24, 2021 20:58
@doyougnu
Copy link
Contributor Author

I believe I've made the requested changes. Please let me know if there is anything else for this PR. I'll open another for the reordering/bang changes soon.

and as always thanks for the reviews!

@treeowl
Copy link
Collaborator

treeowl commented Sep 24, 2021

I'd love to see some attempt to understand the regressions. Which benchmarks get better and which get worse seem just a tad arbitrary. The alterFDelete-miss and isSubmapOf regressions are particularly surprising to me. I'd have expected those to improve.

@doyougnu
Copy link
Contributor Author

doyougnu commented Sep 28, 2021

@treeowl I tried to dig into the regression for alterFDelete-miss, although I can't reproduce it so I suspect that the regression was due to the added bang patterns changing the core. To be sure I verified that the core between this branch and master was identify for both alterFDelete and alterF and it was (attached Internal-32bit.dump-simpl.txt Internal-master.dump-simpl.txt)

So I ran only that benchmark through perf stat -dr5 (run the benchmark 5 times), both master and 32bit-base branch built with -fproc-alignment=64, and frequency scaling is disabled. Furthermore I killed all other applications running on my system:
Here's master:

         Benchmark benchmarks: FINISH
     
     Performance counter stats for 'cabal bench --benchmark-options=-m exact HashMap/alterDelete-miss/ByteString' (5 runs):
     
             25,095.34 msec task-clock:u              #    0.998 CPUs utilized            ( +-  7.85% )
                     0      context-switches:u        #    0.000 K/sec
                     0      cpu-migrations:u          #    0.000 K/sec
               457,847      page-faults:u             #    0.018 M/sec                    ( +- 18.54% )
        96,971,467,839      cycles:u                  #    3.864 GHz                      ( +-  7.36% )  (75.02%)
         4,702,198,414      stalled-cycles-frontend:u #    4.85% frontend cycles idle     ( +-  8.26% )  (75.02%)
        44,504,481,735      stalled-cycles-backend:u  #   45.89% backend cycles idle      ( +-  3.95% )  (75.02%)
       121,184,360,339      instructions:u            #    1.25  insn per cycle
                                                      #    0.37  stalled cycles per insn  ( +-  6.84% )  (75.03%)
        23,780,537,078      branches:u                #  947.608 M/sec                    ( +-  6.70% )  (75.04%)
           441,198,832      branch-misses:u           #    1.86% of all branches          ( +- 15.55% )  (75.03%)
        60,914,366,137      L1-dcache-loads:u         # 2427.318 M/sec                    ( +-  6.62% )  (75.03%)
         1,882,327,510      L1-dcache-load-misses:u   #    3.09% of all L1-dcache accesses  ( +-  7.64% )  (75.02%)
       <not supported>      LLC-loads:u
       <not supported>      LLC-load-misses:u
     
                 25.15 +- 2.01 seconds time elapsed ( +- 8.01% )

and the branch

        Performance counter stats for 'cabal bench --benchmark-options=-m exact HashMap/alterDelete-miss/ByteString' (5 runs):
     
           24,234.97 msec task-clock:u              #    0.998 CPUs utilized            ( +-  7.76% )
                   0      context-switches:u        #    0.000 K/sec
                   0      cpu-migrations:u          #    0.000 K/sec
             454,182      page-faults:u             #    0.019 M/sec                    ( +- 18.55% )
      93,478,538,634      cycles:u                  #    3.857 GHz                      ( +-  7.31% )  (75.03%)
       3,641,546,852      stalled-cycles-frontend:u #    3.90% frontend cycles idle     ( +- 11.64% )  (75.01%)
      43,841,865,823      stalled-cycles-backend:u  #   46.90% backend cycles idle      ( +-  3.16% )  (75.02%)
     118,756,310,576      instructions:u            #    1.27  insn per cycle
                                                    #    0.37  stalled cycles per insn  ( +-  6.98% )  (75.03%)
      23,388,366,084      branches:u                #  965.067 M/sec                    ( +-  6.81% )  (75.03%)
         417,847,294      branch-misses:u           #    1.79% of all branches          ( +- 16.32% )  (75.03%)
      59,336,453,501      L1-dcache-loads:u         # 2448.382 M/sec                    ( +-  6.76% )  (75.04%)
       1,830,916,326      L1-dcache-load-misses:u   #    3.09% of all L1-dcache accesses  ( +-  7.85% )  (75.02%)
     <not supported>      LLC-loads:u
     <not supported>      LLC-load-misses:u
     
               24.28 +- 1.91 seconds time elapsed ( +- 7.85% )

Notice that master clocks in slower (25.15) than 32bit-base (24.28).

I think we would expect lower instruction count for 32bit base because less jmp instructions would be needed since the hashmap should stay shallow for longer. This data does show a 3B reduction in instructions but also a 3B reduction in cycles, so I'm not sure if it is a real signal. Another thing that stuck out to me was the number of stalled backend cycles for both master and 32bit-base, although this seems to be a separate issue, and is probably related to how gauge and cabal is implementing the benchmarks. After all the environment must be loaded into memory before the benchmark can run. Another thing I noticed is that the 32bit-base gets flagged for outliers more frequently than master for the alterFDelete-miss/ByteString benchmark. Here is the gauge output from the perf runs:

master:

benchmarking HashMap/alterDelete-miss/ByteString ... took 13.64 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 40.14 ms   (39.76 ms .. 40.94 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 39.99 ms   (39.77 ms .. 40.25 ms)
std dev              395.0 μs   (258.1 μs .. 620.5 μs)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 13.44 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 39.52 ms   (39.36 ms .. 39.65 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 39.88 ms   (39.75 ms .. 40.15 ms)
std dev              303.3 μs   (185.8 μs .. 437.5 μs)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 13.33 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 39.45 ms   (39.33 ms .. 39.55 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 39.58 ms   (39.52 ms .. 39.68 ms)
std dev              140.7 μs   (70.19 μs .. 218.5 μs)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 13.36 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 39.35 ms   (39.23 ms .. 39.52 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 39.60 ms   (39.48 ms .. 39.86 ms)
std dev              288.4 μs   (94.09 μs .. 476.1 μs)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 13.40 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 40.35 ms   (39.63 ms .. 41.22 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 39.67 ms   (39.46 ms .. 39.99 ms)
std dev              428.4 μs   (259.0 μs .. 642.5 μs)

Notice that there are no heavy outliers reported and master is ~40ms. Now for 32bit-base:

benchmarking HashMap/alterDelete-miss/ByteString ... took 12.43 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 19.29 ms   (17.82 ms .. 20.77 ms)
                     0.991 R²   (0.976 R² .. 1.000 R²)
mean                 23.13 ms   (20.92 ms .. 26.49 ms)
std dev              4.634 ms   (104.2 μs .. 5.815 ms)
variance introduced by outliers: 68% (severely inflated)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 12.42 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 19.44 ms   (18.09 ms .. 20.94 ms)
                     0.991 R²   (0.974 R² .. 1.000 R²)
mean                 23.16 ms   (20.98 ms .. 27.47 ms)
std dev              4.578 ms   (165.0 μs .. 5.737 ms)
variance introduced by outliers: 68% (severely inflated)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 12.45 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 19.90 ms   (18.09 ms .. 21.71 ms)
                     0.989 R²   (0.972 R² .. 0.998 R²)
mean                 23.44 ms   (21.32 ms .. 26.93 ms)
std dev              4.564 ms   (454.2 μs .. 5.767 ms)
variance introduced by outliers: 59% (severely inflated)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 12.43 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 19.48 ms   (17.91 ms .. 21.06 ms)
                     0.988 R²   (0.964 R² .. 1.000 R²)
mean                 23.36 ms   (21.24 ms .. 27.81 ms)
std dev              4.706 ms   (692.9 μs .. 5.990 ms)
variance introduced by outliers: 68% (severely inflated)

Benchmark benchmarks: FINISH
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - unordered-containers-0.2.14.0 (bench:benchmarks) (first run)
Preprocessing benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Building benchmark 'benchmarks' for unordered-containers-0.2.14.0..
Running 1 benchmarks...
Benchmark benchmarks: RUNNING...
benchmarking HashMap/alterDelete-miss/ByteString ... took 12.40 s, total 56 iterations
benchmarked HashMap/alterDelete-miss/ByteString
time                 19.37 ms   (17.87 ms .. 20.95 ms)
                     0.991 R²   (0.972 R² .. 1.000 R²)
mean                 23.16 ms   (20.93 ms .. 27.51 ms)
std dev              4.683 ms   (166.7 μs .. 5.930 ms)
variance introduced by outliers: 68% (severely inflated)

Each benchmark reports severe outliers which seem to skew the measurement low because 32bit-base consistently shows ~20ms. This is so consistent that I almost wonder if its an artifact of the benchmarking setup. My only cahnge to the benchmarking environment was changin n from 4096 to let n = 4 * 2^(16 :: Int) to construct a comparison between master and 32bit-base that would force 32bit-base to grow more than one level. So I'm not sure how much to trust the benchmarks or where the source of the noise would be coming from. It could be possible that the raw data for this benchmark is bimodel, i.e., very fast until the 32bit-base HAMT grows then it regresses to a mean and this is picked up as outliers by gauge.

In any case I compiled more regression statistics, these are up to date for my latest commit (32bit base no bangs no reordering):

                                     Name      master       branch   Difference  PctDifference Faster
 1:        HashMap/isSubmapOfNaive/String     53.7884      65.0119   11.2235       20.86        82.74
 2:            HashMap/fromList/short/Int     50.3661      60.1439    9.7778       19.41        83.74
 3:        HashMap/fromListWith/short/Int     57.0969      66.9412    9.8443       17.24        85.29
 4:  HashMap/alterFDelete-miss/ByteString     26.4369      28.8611    2.4242        9.16        91.60
 5:               HashMap/alterInsert/Int    153.8916     167.8327   13.9410        9.05        91.69
 6:       HashMap/alterFInsert/ByteString    172.5733     186.9022   14.3289        8.30        92.33
 7:             HashMap/insert/ByteString    174.3504     188.4732   14.1227        8.10        92.51
 8:      HashMap/fromList/long/ByteString     98.4718     104.7043    6.2324        6.32        94.05
 9:        HashMap/alterInsert/ByteString    189.8528     201.0552   11.2024        5.90        94.43
10:         HashMap/fromList/short/String     34.3230      36.2094    1.8863        5.49        94.79
11:                    HashMap/insert/Int    150.7048     156.3881    5.6833        3.77        96.37
12:            HashMap/alterInsert/String    195.1582     202.1771    7.0188        3.59        96.53
13:              HashMap/alterFInsert/Int    156.1919     157.4893    1.2973        0.83        99.18
14:           HashMap/alterFInsert/String    210.3690     210.2779   -0.0910       -0.04       100.04
15:      HashMap/fromListWith/long/String    115.6609     115.5865   -0.0743       -0.06       100.06
16:         HashMap/isSubmapOf/ByteString     11.9745      11.8290   -0.1454       -1.21       101.23
17:          HashMap/alterFInsert-dup/Int    129.1681     125.0134   -4.1547       -3.21       103.32
18:              HashMap/alterFDelete/Int    125.4869     120.6289   -4.8580       -3.87       104.03
19:     HashMap/fromList/short/ByteString     23.7682      22.8258   -0.9424       -3.96       104.13
20:                 HashMap/insert/String    219.4855     210.7733   -8.7121       -3.96       104.13
21:             HashMap/delete/ByteString    182.3416     172.9844   -9.3571       -5.13       105.41
22:               HashMap/alterDelete/Int    131.6931     123.8543   -7.8388       -5.95       106.33
23:          HashMap/fromList/long/String    123.8589     115.8239   -8.0349       -6.48       106.94
24:                           HashMap/map     17.5870      16.4131   -1.1739       -6.67       107.15
25:                    HashMap/delete/Int    129.5927     120.9106   -8.6820       -6.69       107.18
26:                HashMap/insert-dup/Int    131.7386     122.7031   -9.0355       -6.85       107.36
27:       HashMap/alterDelete-miss/String     37.9127      35.2327   -2.6799       -7.06       107.61
28:             HashMap/fromList/long/Int     80.9759      74.9517   -6.0241       -7.43       108.04
29:     HashMap/fromListWith/short/String     26.2560      24.1738   -2.0821       -7.93       108.61
30:       HashMap/alterFDelete/ByteString    187.7723     171.3584  -16.4139       -8.74       109.58
31:        HashMap/alterDelete/ByteString    188.3785     171.5909  -16.7876       -8.91       109.78
32:         HashMap/fromListWith/long/Int     86.6287      78.5805   -8.0481       -9.29       110.24
33:           HashMap/alterInsert-dup/Int    133.4730     121.0694  -12.4036       -9.29       110.25
34:  HashMap/fromListWith/long/ByteString     96.9519      86.5336  -10.4183      -10.74       112.04
35:         HashMap/alterFDelete-miss/Int     69.7377      62.0982   -7.6395      -10.95       112.30
36:             HashMap/lookup/ByteString     87.9955      78.1580   -9.8374      -11.17       112.59
37: HashMap/fromListWith/short/ByteString     20.8472      18.4706   -2.3766      -11.40       112.87
38:                 HashMap/lookup/String    192.2641     169.7583  -22.5058      -11.70       113.26
39:    HashMap/isSubmapOfNaive/ByteString     17.8681      15.7542   -2.1138      -11.83       113.42
40:                  HashMap/intersection     27.9015      24.1269   -3.7745      -13.52       115.64
41:               HashMap/size/ByteString      3.6008       3.1089   -0.4919      -13.66       115.82
42:                    HashMap/difference     29.1113      24.9238   -4.1874      -14.38       116.80
43:          HashMap/alterDelete-miss/Int     78.2321      65.5912  -12.6408      -16.15       119.27
44:         HashMap/insert-dup/ByteString     90.6601      75.8275  -14.8326      -16.36       119.56
45:                        HashMap/foldl'      3.9046       3.2459   -0.6586      -16.86       120.29
46:                    HashMap/lookup/Int     57.9009      47.8670  -10.0338      -17.32       120.96
47:               HashMap/delete-miss/Int     69.4378      57.0186  -12.4191      -17.88       121.78
48:    HashMap/alterInsert-dup/ByteString     96.6344      79.0962  -17.5382      -18.14       122.17
49:   HashMap/alterFInsert-dup/ByteString     91.8934      75.0882  -16.8051      -18.28       122.38
50:                 HashMap/filterWithKey      6.7572       5.5201   -1.2371      -18.30       122.41
51:               HashMap/lookup-miss/Int     33.2655      27.0819   -6.1836      -18.58       122.83
52:                        HashMap/filter     13.4468      10.7097   -2.7371      -20.35       125.56
53:                      HashMap/size/Int      1.1857       0.9361   -0.2495      -21.04       126.65
54:            HashMap/lookup-miss/String     33.6928      26.5011   -7.1916      -21.34       127.14
55:      HashMap/alterFDelete-miss/String     33.5248      26.3396   -7.1852      -21.43       127.28
56:            HashMap/delete-miss/String     33.2863      26.0768   -7.2094      -21.65       127.65
57:                         HashMap/union     11.6485       8.9965   -2.6520      -22.76       129.48
58:             HashMap/insert-dup/String    137.4933     106.1026  -31.3906      -22.83       129.59
59:       HashMap/alterFInsert-dup/String    133.7493     102.6985  -31.0508      -23.21       130.23
60:        HashMap/delete-miss/ByteString     26.3743      20.0588   -6.3155      -23.94       131.49
61:        HashMap/lookup-miss/ByteString     22.9625      17.4523   -5.5101      -23.99       131.57
62:                   HashMap/size/String      3.0384       2.2222   -0.8162      -26.86       136.73
63:   HashMap/alterDelete-miss/ByteString     29.0384      20.6113   -8.4271      -29.02       140.89
64:            HashMap/alterDelete/String    297.3323     192.6111 -104.7211      -35.22       154.37
65:                 HashMap/delete/String    290.7659     184.7439 -106.0220      -36.46       157.39
66:           HashMap/alterFDelete/String    295.6333     184.4920 -111.1413      -37.59       160.24
67:        HashMap/alterInsert-dup/String    198.6804      94.0286 -104.6518      -52.67       211.30
68:                         HashMap/foldr      4.6686       2.1455   -2.5231      -54.04       217.60
69:                HashMap/isSubmapOf/Int      0.0003       0.0001   -0.0001      -61.41       259.14
70:             HashMap/isSubmapOf/String     51.9250      18.1427  -33.7822      -65.05       286.20
71:           HashMap/isSubmapOfNaive/Int      0.0001       0.0000   -0.0001      -77.63       447.04

Apologies for the novel!

@treeowl
Copy link
Collaborator

treeowl commented Oct 27, 2021

This is generally looking promising. Can you make it pass CI, and maybe do one more run of the benchmarks to see how stable the results are?

@sjakobi
Copy link
Member

sjakobi commented Nov 1, 2021

Needs a rebase as well.

I'll try to reproduce the benchmark results.

benchmarks/Benchmarks.hs Outdated Show resolved Hide resolved
@doyougnu
Copy link
Contributor Author

doyougnu commented Nov 3, 2021

I've rebased on top of master and rerun the benchmarks, here is the latest master is 16bit, branch is the 32bit. All benchmarks run with frequency scaling disabled and -O2 -fproc-alignment=64, and n=2^18.

If this looks good then I'll push a commit that either adds a comment to the benchmark suite as requested in #317 (comment) or revert to n=4096 under the assumption that benchmarking with n=2^18 was only important for observing the effect of the base change as I argue above.

Data

                                  Name           master           branch      Difference     PctDifference        Faster 
            HashMap/fromList/short/Int     4.910742e+01     6.049123e+01    1.138382e+01         23.181459      81.18105 
         HashMap/isSubmapOf/ByteString     1.177911e+01     1.400347e+01    2.224358e+00         18.883924      84.11566
        HashMap/fromListWith/short/Int     5.599026e+01     6.517981e+01    9.189551e+00         16.412766      85.90123
               HashMap/alterInsert/Int     1.599502e+02     1.811201e+02    2.116986e+01         13.235280      88.31170
             HashMap/insert/ByteString     1.773215e+02     1.977108e+02    2.038937e+01         11.498532      89.68728
       HashMap/alterFInsert/ByteString     1.767804e+02     1.949816e+02    1.820116e+01         10.295915      90.66519
            HashMap/alterInsert/String     1.973812e+02     2.090380e+02    1.165679e+01          5.905724      94.42360
        HashMap/alterInsert/ByteString     1.954438e+02     1.995942e+02    4.150414e+00          2.123585      97.92057
              HashMap/alterFDelete/Int     1.206649e+02     1.230326e+02    2.367745e+00          1.962249      98.07551
                    HashMap/insert/Int     1.540457e+02     1.567231e+02    2.677426e+00          1.738073      98.29162
              HashMap/alterFInsert/Int     1.591382e+02     1.589622e+02   -1.760100e-01         -0.110602     100.11072
                    HashMap/delete/Int     1.241880e+02     1.225961e+02   -1.591856e+00         -1.281812     101.29846
               HashMap/alterDelete/Int     1.257897e+02     1.216033e+02   -4.186441e+00         -3.328126     103.44270
       HashMap/alterFDelete/ByteString     1.854743e+02     1.784279e+02   -7.046408e+00         -3.799129     103.94916
             HashMap/delete/ByteString     1.818201e+02     1.743219e+02   -7.498266e+00         -4.124002     104.30139
                           HashMap/map     1.691531e+01     1.614763e+01   -7.676762e-01         -4.538353     104.75411
        HashMap/alterDelete/ByteString     1.920451e+02     1.804542e+02   -1.159099e+01         -6.035555     106.42323
           HashMap/alterInsert-dup/Int     1.315102e+02     1.233881e+02   -8.122156e+00         -6.176065     106.58261
                 HashMap/lookup/String     1.818030e+02     1.700636e+02   -1.173938e+01         -6.457198     106.90293
       HashMap/alterDelete-miss/String     3.765475e+01     3.507113e+01   -2.583619e+00         -6.861337     107.36680
          HashMap/alterFInsert-dup/Int     1.302699e+02     1.210880e+02   -9.181942e+00         -7.048399     107.58287
          HashMap/fromList/long/String     1.225441e+02     1.138631e+02   -8.680970e+00         -7.083958     107.62404
      HashMap/fromList/long/ByteString     1.080684e+02     1.004003e+02   -7.668144e+00         -7.095638     107.63757
  HashMap/fromListWith/long/ByteString     9.450552e+01     8.721944e+01   -7.286078e+00         -7.709686     108.35373
    HashMap/isSubmapOfNaive/ByteString     1.732665e+01     1.598408e+01   -1.342572e+00         -7.748597     108.39944
      HashMap/fromListWith/long/String     1.117847e+02     1.028493e+02   -8.935472e+00         -7.993463     108.68793
     HashMap/fromList/short/ByteString     2.295171e+01     2.106367e+01   -1.888043e+00         -8.226155     108.96351
                HashMap/insert-dup/Int     1.306649e+02     1.198167e+02   -1.084825e+01         -8.302346     109.05404
                  HashMap/intersection     2.872114e+01     2.630418e+01   -2.416958e+00         -8.415258     109.18849
         HashMap/fromListWith/long/Int     8.538829e+01     7.780994e+01   -7.578351e+00         -8.875164     109.73957
             HashMap/fromList/long/Int     7.986437e+01     7.260388e+01   -7.260495e+00         -9.091031     110.00015
             HashMap/lookup/ByteString     8.644567e+01     7.855436e+01   -7.891309e+00         -9.128634     110.04567
         HashMap/alterFDelete-miss/Int     7.022211e+01     6.283875e+01   -7.383353e+00        -10.514286     111.74968
         HashMap/fromList/short/String     3.488598e+01     3.027198e+01   -4.613999e+00        -13.225939     115.24181
          HashMap/alterDelete-miss/Int     7.348141e+01     6.367238e+01   -9.809024e+00        -13.348987     115.40546
     HashMap/fromListWith/short/String     2.546586e+01     2.199018e+01   -3.475681e+00        -13.648394     115.80561
               HashMap/size/ByteString     3.564036e+00     3.076971e+00   -4.870659e-01        -13.666131     115.82940
                    HashMap/lookup/Int     5.697039e+01     4.908035e+01   -7.890039e+00        -13.849368     116.07576
 HashMap/fromListWith/short/ByteString     2.102717e+01     1.808464e+01   -2.942527e+00        -13.993929     116.27086
                   HashMap/size/String     3.020379e+00     2.596332e+00   -4.240462e-01        -14.039505     116.33251
               HashMap/lookup-miss/Int     3.239016e+01     2.777617e+01   -4.613989e+00        -14.245033     116.61132
                        HashMap/filter     1.306421e+01     1.109199e+01   -1.972221e+00        -15.096372     117.78060
               HashMap/delete-miss/Int     7.129732e+01     6.030180e+01   -1.099552e+01        -15.422065     118.23415
                    HashMap/difference     2.843293e+01     2.401328e+01   -4.419659e+00        -15.544154     118.40507
         HashMap/insert-dup/ByteString     9.035914e+01     7.630520e+01   -1.405394e+01        -15.553423     118.41806
             HashMap/insert-dup/String     1.236098e+02     1.035281e+02   -2.008167e+01        -16.246022     119.39731
  HashMap/alterFDelete-miss/ByteString     2.513166e+01     2.094475e+01   -4.186906e+00        -16.659888     119.99024
           HashMap/alterFInsert/String     2.664760e+02     2.218358e+02   -4.464014e+01        -16.752034     120.12305
   HashMap/alterFInsert-dup/ByteString     8.905277e+01     7.409451e+01   -1.495825e+01        -16.797069     120.18807
    HashMap/alterInsert-dup/ByteString     9.588865e+01     7.848076e+01   -1.740788e+01        -18.154270     122.18108
                 HashMap/filterWithKey     6.630574e+00     5.356519e+00   -1.274054e+00        -19.214840     123.78511
            HashMap/lookup-miss/String     3.294323e+01     2.642682e+01   -6.516405e+00        -19.780712     124.65830
                      HashMap/size/Int     1.181516e+00     9.342195e-01   -2.472967e-01        -20.930452     126.47094
                 HashMap/insert/String     2.799710e+02     2.205630e+02   -5.940796e+01        -21.219325     126.93468
      HashMap/alterFDelete-miss/String     3.305971e+01     2.594345e+01   -7.116258e+00        -21.525470     127.42988
            HashMap/delete-miss/String     3.324601e+01     2.587990e+01   -7.366105e+00        -22.156361     128.46265
                         HashMap/union     1.138301e+01     8.847915e+00   -2.535093e+00        -22.270852     128.65187
        HashMap/delete-miss/ByteString     2.531636e+01     1.960278e+01   -5.713578e+00        -22.568720     129.14677
       HashMap/alterFInsert-dup/String     1.300182e+02     9.981083e+01   -3.020737e+01        -23.233189     130.26463
                        HashMap/foldl'     4.103430e+00     3.149546e+00   -9.538835e-01        -23.246007     130.28638
   HashMap/alterDelete-miss/ByteString     2.733282e+01     2.072825e+01   -6.604575e+00        -24.163531     131.86268
        HashMap/lookup-miss/ByteString     2.260036e+01     1.708222e+01   -5.518140e+00        -24.416158     132.30341
            HashMap/alterDelete/String     2.903397e+02     1.893266e+02   -1.010131e+02        -34.791357     153.35391
           HashMap/alterFDelete/String     2.850219e+02     1.842162e+02   -1.008058e+02        -35.367719     154.72145
                 HashMap/delete/String     2.851995e+02     1.841144e+02   -1.010851e+02        -35.443647     154.90342
        HashMap/alterInsert-dup/String     1.994878e+02     9.226460e+01   -1.072232e+02        -53.749250     216.21271
                         HashMap/foldr     4.635728e+00     2.114410e+00   -2.521318e+00        -54.388828     219.24453
        HashMap/isSubmapOfNaive/String     5.360858e+01     2.002819e+01   -3.358039e+01        -62.639959     267.66566
                HashMap/isSubmapOf/Int     3.427139e-04     1.208895e-04   -2.218243e-04        -64.725814     283.49343
             HashMap/isSubmapOf/String     5.268743e+01     1.810605e+01   -3.458138e+01        -65.634979     290.99357
           HashMap/isSubmapOfNaive/Int     1.306414e-04     3.841304e-05   -9.222839e-05        -70.596589     340.09659

@sjakobi sjakobi mentioned this pull request Nov 7, 2021
@sjakobi
Copy link
Member

sjakobi commented Nov 11, 2021

I'm happy with these numbers! :)

I also realized that this is a purely internal change, so if it should turn out that the old 16bit base was "better" overall, we can easily revert this change.

If this looks good then I'll push a commit that either adds a comment to the benchmark suite as requested in #317 (comment) or revert to n=4096 under the assumption that benchmarking with n=2^18 was only important for observing the effect of the base change as I argue above.

Reverting to the smaller number seems good to me.

@doyougnu
Copy link
Contributor Author

Reverting to the smaller number seems good to me.

@sjakobi just pushed in f64e641

Please let me know if there is anything else before 0.2.15.0. I'd like to have this make it into the next release!

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Jeffrey! :)

Please let me know if there is anything else before 0.2.15.0. I'd like to have this make it into the next release!

v0.2.15.0 has already been released – I didn't want to delay the migration to hashable-1.4 any further. I'll make another release within this month though.

@sjakobi sjakobi requested a review from treeowl November 11, 2021 19:46
@treeowl
Copy link
Collaborator

treeowl commented Nov 11, 2021

I'll have a look. One question: should we add a Data.HashMap16 or similar with the old version so people can choose that if it's best for their application? Ideally, it would be exactly the same file except for one line of CPP.

@sjakobi
Copy link
Member

sjakobi commented Nov 11, 2021

@treeowl I think we can consider something like that if people are unhappy with with the base 32 version. I wouldn't want to offer it upfront, since it would double the size of the API and would probably come with a tradeoff in maintainability.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
bump benchmark element count
@treeowl
Copy link
Collaborator

treeowl commented Nov 15, 2021

Ooh, yuck. I think the usual advice is to wait for at least an alpha release, since otherwise things can shift around.

@sjakobi sjakobi merged commit 2c8a286 into haskell-unordered-containers:master Nov 29, 2021
@sjakobi
Copy link
Member

sjakobi commented Nov 29, 2021

Thank you, @doyougnu! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider branching factor of 32 or 64
5 participants