Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize folds #273

Merged
merged 8 commits into from
Sep 9, 2020
Merged

Optimize folds #273

merged 8 commits into from
Sep 9, 2020

Conversation

Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented Aug 27, 2020

This is #151 rebased, see explanations there. Closes #206. CC @sergv

@Bodigrim
Copy link
Contributor Author

Benchmarking appeared to be too tempting. Here are results, and I think they are hilarious! foldl' is roughly 5-10% faster and foldr' is 10-15% faster. Next, filter is 15-20% faster, mapAccumL, mapAccumR and scanl are 20-25% faster, while scanr is 25-30% faster. It would be fantastic to merge this before 0.11.0.0 release.

Benchmark          Before     After    Speedup
----------------------------------------------
foldl'/96         328.1 ns   328.0 ns   0.03%
foldl'/192        694.2 ns   671.5 ns   3.27%
foldl'/384        1.398 μs   1.333 μs   4.65%
foldl'/768        2.786 μs   2.658 μs   4.59%
foldl'/1536       5.556 μs   5.323 μs   4.19%
foldl'/3072       11.08 μs   10.43 μs   5.87%
foldl'/6144       22.15 μs   20.94 μs   5.46%
foldl'/12288      44.73 μs   41.36 μs   7.53%
foldl'/24576      89.31 μs   82.52 μs   7.60%
foldl'/49152      180.7 μs   164.2 μs   9.13%

foldr'/96         386.3 ns   339.9 ns  12.01%
foldr'/192        742.6 ns   658.3 ns  11.35%
foldr'/384        1.452 μs   1.297 μs  10.67%
foldr'/768        2.879 μs   2.569 μs  10.77%
foldr'/1536       5.746 μs   5.123 μs  10.84%
foldr'/3072       11.54 μs   10.33 μs  10.49%
foldr'/6144       23.23 μs   20.32 μs  12.53%
foldr'/12288      46.08 μs   40.24 μs  12.67%
foldr'/24576      92.97 μs   80.99 μs  12.89%
foldr'/49152      187.0 μs   162.0 μs  13.37%

filter/96         421.7 ns   352.0 ns  16.53%
filter/192        807.4 ns   670.6 ns  16.94%
filter/384        1.587 μs   1.298 μs  18.21%
filter/768        3.146 μs   2.618 μs  16.78%
filter/1536       6.314 μs   5.136 μs  18.66%
filter/3072       12.45 μs   10.20 μs  18.07%
filter/6144       24.72 μs   20.28 μs  17.96%
filter/12288      49.51 μs   40.25 μs  18.70%
filter/24576      98.68 μs   80.45 μs  18.47%
filter/49152      201.1 μs   161.9 μs  19.49%

mapAccumL/96      528.3 ns   402.5 ns  23.81%
mapAccumL/192     1.023 μs   777.7 ns  23.98%
mapAccumL/384     2.021 μs   1.508 μs  25.38%
mapAccumL/768     4.004 μs   3.033 μs  24.25%
mapAccumL/1536    8.027 μs   6.032 μs  24.85%
mapAccumL/3072    16.02 μs   11.97 μs  25.28%
mapAccumL/6144    32.08 μs   23.72 μs  26.06%
mapAccumL/12288   64.24 μs   47.95 μs  25.36%
mapAccumL/24576   125.7 μs   95.53 μs  24.00%
mapAccumL/49152   258.2 μs   195.4 μs  24.32%

mapAccumR/96      512.7 ns   397.0 ns  22.57%
mapAccumR/192     989.2 ns   761.9 ns  22.98%
mapAccumR/384     1.937 μs   1.506 μs  22.25%
mapAccumR/768     3.951 μs   2.995 μs  24.20%
mapAccumR/1536    7.999 μs   5.944 μs  25.69%
mapAccumR/3072    15.54 μs   11.76 μs  24.32%
mapAccumR/6144    31.44 μs   23.33 μs  25.80%
mapAccumR/12288   61.30 μs   46.41 μs  24.29%
mapAccumR/24576   120.9 μs   93.71 μs  22.49%
mapAccumR/49152   246.2 μs   194.0 μs  21.20%

scanl/96          477.3 ns   373.0 ns  21.85%
scanl/192         925.4 ns   702.9 ns  24.04%
scanl/384         1.830 μs   1.389 μs  24.10%
scanl/768         3.557 μs   2.760 μs  22.41%
scanl/1536        7.205 μs   5.453 μs  24.32%
scanl/3072        14.24 μs   11.06 μs  22.33%
scanl/6144        28.79 μs   22.24 μs  22.75%
scanl/12288       56.75 μs   43.64 μs  23.10%
scanl/24576       115.1 μs   86.57 μs  24.79%
scanl/49152       239.4 μs   181.1 μs  24.35%

scanr/96          469.5 ns   342.8 ns  26.99%
scanr/192         915.0 ns   653.5 ns  28.58%
scanr/384         1.797 μs   1.285 μs  28.49%
scanr/768         3.606 μs   2.539 μs  29.59%
scanr/1536        7.133 μs   5.110 μs  29.36%
scanr/3072        14.17 μs   10.08 μs  28.86%
scanr/6144        28.16 μs   20.22 μs  28.20%
scanr/12288       57.23 μs   40.76 μs  28.78%
scanr/24576       112.5 μs   81.62 μs  27.45%
scanr/49152       235.4 μs   165.6 μs  29.65%

@Bodigrim Bodigrim requested review from sjakobi and hsyl20 August 28, 2020 18:02
@chessai
Copy link
Member

chessai commented Aug 28, 2020 via email

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, @Bodigrim! :)

The speed-ups are amazing!

IIUC all these changes do is lift static arguments out of the inner worker functions. Isn't this something that GHC ought to be able to do automatically?! Is this a compiler bug?

I noticed that you didn't add benchmarks for the following functions:

  • foldl
  • foldr
  • any
  • all
  • spanByte
  • splitWith
  • zipWith'
  • findIndexOrEnd

No idea whether any of these need additional scrutiny though.

Regarding correctness, I assume that all the changed functions have reasonable test coverage, but I haven't checked.

BTW, how did you produce the table of benchmark results, @Bodigrim? This is a very nice format.

@Bodigrim
Copy link
Contributor Author

Isn't this something that GHC ought to be able to do automatically?!

I do not expect GHC being able to do it automatically, certainly not in the case when constant arguments are trailing ones. The closest optimization I know about is call-pattern specialization, but it does not cover our case.

One can check Core generated for these functions:

{-# LANGUAGE BangPatterns #-}

module Lib (countFromTo, countFromTo') where

countFromTo :: Int -> Int -> Int
countFromTo !from !to = go 0 from to
  where
    go !acc !f !t
      | f == t = acc
      | otherwise = go (acc + 1) (f + 1) t

countFromTo' :: Int -> Int -> Int
countFromTo' !from !to = go 0 from
  where
    go !acc !f
      | f == to = acc
      | otherwise = go (acc + 1) (f + 1)

The inner loop of countFromTo becomes

$wgo :: Int# -> Int# -> Int# -> Int#
$wgo
  = \ (ww_s1xz :: Int#) (ww1_s1xD :: Int#) (ww2_s1xH :: Int#) ->
      case ==# ww1_s1xD ww2_s1xH of {
        __DEFAULT -> $wgo (+# ww_s1xz 1#) (+# ww1_s1xD 1#) ww2_s1xH;
        1# -> ww_s1xz
      }

Notice ww2_s1xH remains passed around, no magic optimization happens.

I noticed that you didn't add benchmarks for the following functions:

It does not make much sense to benchmark lazy foldl and foldr. Other function require some clever choice of input strings to avoid short-circuiting and several variants of predicates. Like, should we benchmark any with a predicate (== 99) or (/= 99) or odd, etc.? I think provided benchmarks are representative enough to show that it is certainly beneficial to pass less arguments recursively. (I can add more benchmarks, if someone insists on additional scrutiny, of course)

Regarding correctness, I assume that all the changed functions have reasonable test coverage, but I haven't checked.

There are tests, but I still encourage reviewers to verify changes closely.

BTW, how did you produce the table of benchmark results, @Bodigrim? This is a very nice format.

By hand :)

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 5, 2020

Anyone else willing to take a look please? @hsyl20 @vdukhovni @fumieval

Data/ByteString.hs Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
Data/ByteString.hs Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
y <- peekByteOff p2 n
pokeByteOff r n (f x y)
zipWith_ (n+1) p1 p2 r
go p1 p2 = zipWith_ 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question. Under what conditions is it actually better to avoid passing fixed arguments on the stack in recursive calls, and instead allocate a closure around the fixed arguments, that avoids having to copy them into each stack frame?

I must admit I don't really understand what the compiler does here, but if the new version has to perform closure allocations that the old version does not, then perhaps the cost of carrying the extra arguments around is worth the ultimate reduction in GC pressure.

If the benchmarks only measure speed, but don't measure GC pressure, we may not be doing the right thing here (and elsewhere...). The benchmarks would have to run long enough to take the GC overhead into account...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is always preferable to avoid passing fixed arguments (unless it is a particularly degenerate case). An additional closure (if any) is allocated only once per call, which is not a big deal at all, but additional arguments are passed for each element of the string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closure allocation is however much more expensive than pushing a parameter onto the stack, so for sufficiently short strings the savings may be small, and the ultimate GC costs non-negligible if one processes a lot of short strings, or even a lot of strings period, sometimes it is better to pay the price as you go than to accumulate a lot of garbage to later clean up and stall.

Mind you, I'd like to hear from @hsyl20 et. al. on what the actual tradeoffs are here. I don't yet have a clear enough sense of when the various costs and benefits come into play...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes closure allocation can be costly. I have a patch for containers that removes this kind of closure for a lookup in IntMap.

In this specific zipWith' case, we don't allocate any closure as far as I can tell from looking at STG. create and zipWith_ are inlined and transformed into join points.

@maoe
Copy link
Member

maoe commented Sep 7, 2020

Isn't this something that GHC ought to be able to do automatically?!

I do not expect GHC being able to do it automatically, certainly not in the case when constant arguments are trailing ones. The closest optimization I know about is call-pattern specialization, but it does not cover our case.

This optimization is called Static Argument Transformation (SAT). It's implemented in GHC but is limited to be applied to functions that have two or more static arguments.

There's a performance measurement and some discussions in Compilation by transformation for non-strict functional languages by Andre Santos. As you can see in the thesis, SAT is not always a win.

EDIT: Fixed the broken link to the thesis.

@vdukhovni
Copy link
Contributor

Isn't this something that GHC ought to be able to do automatically?!

I do not expect GHC being able to do it automatically, certainly not in the case when constant arguments are trailing ones. The closest optimization I know about is call-pattern specialization, but it does not cover our case.

This optimization is called Static Argument Transformation (SAT). It's implemented in GHC but is limited to be applied to functions that have two or more static arguments.

There's a performance measurement and some discussions in Compilation by transformation for non-strict functional languages by Andre Santos. As you can see in the thesis, SAT is not always a win.

Thanks (though the above link no longer works, I was able to find another copy.
It raises in more details some of the concerns that I was only able to guess at. I don't know what's actually happening in this regard in GHC ~20 years since that work, and would be interested in learning more. Bottom line, I'd like to have "better theory" for this PR, so that we fully understand the costs/benefits (unless there somehow is no cost and it is all "benefit").

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 7, 2020

OK, so according to Santos' thesis, the only downside of SAT is a suspected increase in heap allocations, when we run a function many times on short inputs.

I don't know what's actually happening in this regard in GHC ~20 years since that work, and would be interested in learning more.

Join point optimization comes to the rescue. For example, foldl' (\acc x -> acc + fromIntegral x) 0 s, where foldl' is defined per this PR, compiles to

foo :: ByteString -> Int
foo
  = \ (w_svf0 :: ByteString) ->
      case w_svf0 of { BS ww1_svf3 ww2_svf4 ww3_svf5 ->
      let {
        end_svdU :: Addr#
        end_svdU = plusAddr# ww1_svf3 ww3_svf5 } in
      join {
        exit_Xj :: Int# -> State# RealWorld -> Int
        exit_Xj (ww4_sveT :: Int#) (w1_sveQ :: State# RealWorld)
          = case touch# ww2_svf4 w1_sveQ of { __DEFAULT -> I# ww4_sveT } } in
      joinrec {
        $wgo_sveZ :: Int# -> Addr# -> State# RealWorld -> Int
        $wgo_sveZ (ww4_sveT :: Int#)
                  (ww5_sveX :: Addr#)
                  (w1_sveQ :: State# RealWorld)
          = case eqAddr# ww5_sveX end_svdU of {
              __DEFAULT ->
                case readWord8OffAddr# ww5_sveX 0# w1_sveQ of
                { (# ipv1_a6d7, ipv2_a6d8 #) ->
                jump $wgo_sveZ
                  (+# ww4_sveT (word2Int# ipv2_a6d8))
                  (plusAddr# ww5_sveX 1#)
                  ipv1_a6d7
                };
              1# -> jump exit_Xj ww4_sveT w1_sveQ
            }; } in
      jump $wgo_sveZ 0# ww1_svf3 realWorld#
      }

Notice join and joinrec instead of lets, and jumps.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 7, 2020

And here are benchmarks for short strings. The new version is again consistently faster:

Benchmark          Before     After   
------------------------------------
foldl'/1          12.22 ns  9.253 ns 
foldl'/2          16.06 ns  12.39 ns 
foldl'/4          23.30 ns  18.45 ns 
foldl'/8          42.77 ns  32.32 ns 
foldl'/16         78.52 ns  61.57 ns 
foldl'/32         136.0 ns  110.7 ns 
foldl'/64         253.2 ns  205.0 ns 
foldr'/1          12.42 ns  9.727 ns 
foldr'/2          15.99 ns  12.77 ns 
foldr'/4          23.85 ns  18.95 ns 
foldr'/8          43.01 ns  33.50 ns 
foldr'/16         78.72 ns  61.53 ns 
foldr'/32         134.2 ns  112.5 ns 
foldr'/64         251.8 ns  214.1 ns 
mapAccumL/1       24.08 ns  19.07 ns 
mapAccumL/2       30.70 ns  22.93 ns 
mapAccumL/4       41.47 ns  31.18 ns 
mapAccumL/8       67.68 ns  46.85 ns 
mapAccumL/16      107.2 ns  83.84 ns 
mapAccumL/32      188.0 ns  144.7 ns 
mapAccumL/64      343.3 ns  268.8 ns 
mapAccumR/1       24.67 ns  20.64 ns 
mapAccumR/2       30.21 ns  24.52 ns 
mapAccumR/4       41.00 ns  31.47 ns 
mapAccumR/8       66.84 ns  47.55 ns 
mapAccumR/16      106.2 ns  85.31 ns 
mapAccumR/32      186.8 ns  145.1 ns 
mapAccumR/64      337.9 ns  270.4 ns 
scanl/1           23.33 ns  19.55 ns 
scanl/2           28.32 ns  23.14 ns 
scanl/4           38.07 ns  30.99 ns 
scanl/8           62.14 ns  48.54 ns 
scanl/16          99.18 ns  82.06 ns 
scanl/32          174.0 ns  139.0 ns 
scanl/64          316.2 ns  251.5 ns 
scanr/1           23.21 ns  20.04 ns 
scanr/2           28.15 ns  23.43 ns 
scanr/4           37.48 ns  30.66 ns 
scanr/8           55.57 ns  50.68 ns 
scanr/16          98.10 ns  78.62 ns 
scanr/32          170.6 ns  131.7 ns 
scanr/64          311.5 ns  239.3 ns 
filter/1          30.60 ns  29.08 ns 
filter/2          35.00 ns  33.74 ns 
filter/4          42.83 ns  39.42 ns 
filter/8          58.78 ns  52.38 ns 
filter/16         87.78 ns  78.67 ns 
filter/32         150.2 ns  131.2 ns 
filter/64         283.7 ns  243.8 ns 

@vdukhovni
Copy link
Contributor

OK, so according to Santos' thesis, the only downside of SAT is a suspected increase in heap allocations, when we run a function many times on short inputs.

I don't know what's actually happening in this regard in GHC ~20 years since that work, and would be interested in learning more.

Join point optimization comes to the rescue. For example, foldl' (\acc x -> acc + fromIntegral x) 0 s, where foldl' is defined per this PR, compiles to

Thanks for that, some clarity is emerging. The only remaining question for me is whether all of the modified functions fall into this case, or whether for some reason some slip through the cracks.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 7, 2020

The only remaining question for me is whether all of the modified functions fall into this case, or whether for some reason some slip through the cracks.

  • Changes to foldl and foldr save one static argument and do not introduce additional closures.
  • For foldl', foldr', any, all, spanByte, filter, findIndexOrEnd all introduced closures are joins (and we save one static argument).
  • We save two static argument in mapAccumL, mapAccumR, scanl, scanr, zipWith'. So, as explained in Santos's thesis, SAT optimization is certainly beneficial (not sure why GHC decided not to apply it automatically).
  • Finally, for splitWith we save one static argument and all additional closures are joins. Both versions of the function allocate one letrec closure, but previously it had 7 arguments and only 4 now.

@vdukhovni
Copy link
Contributor

The only remaining question for me is whether all of the modified functions fall into this case, or whether for some reason some slip through the cracks.

  • Changes to foldl and foldr save one static argument and do not introduce additional closures.
  • For foldl', foldr', any, all, spanByte, filter, findIndexOrEnd all introduced closures are joins (and we save one static argument).
  • We save two static argument in mapAccumL, mapAccumR, scanl, scanr, zipWith'. So, as explained in Santos's thesis, SAT optimization is certainly beneficial (not sure why GHC decided not to apply it automatically).
  • Finally, for splitWith we save one static argument and all additional closures are joins. Both versions of the function allocate one letrec closure, but previously it had 7 arguments and only 4 now.

Cool. I'm sold. Much appreciated.

@sjakobi
Copy link
Member

sjakobi commented Sep 8, 2020

Thanks for the background info and investigation, everyone! :)

Ping @hsyl20!

@vdukhovni
Copy link
Contributor

Thanks for the background info and investigation, everyone! :)

Ping @hsyl20!

Indeed it would now perhaps be interesting to know whether the compiler has enough smarts to notice that the values are just passed around unmodified (even at the ends of the argument list), and could/should perhaps perform the same optimisation automatically when it already knows that it is add join points, and there's no associated closure allocation cost.

Copy link
Contributor

@hsyl20 hsyl20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this work! LGTM

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 9, 2020

All credit goes to @sergv. Thanks, @sergv!

@Bodigrim Bodigrim merged commit 35796e7 into haskell:master Sep 9, 2020
@Bodigrim Bodigrim deleted the optimize-folds branch September 9, 2020 17:25
@Bodigrim Bodigrim added this to the 0.11.0.0 milestone Sep 9, 2020
@treeowl
Copy link

treeowl commented Sep 27, 2020

Do these all/any rules actually fire? RULES matching on operator sections (especially of the (== a) variety) smell flaky. And matching on class methods (== as opposed to eqWord8) is at least historically troublesome.

@Bodigrim
Copy link
Contributor Author

@treeowl They do.

{-# OPTIONS_GHC -ddump-rule-firings #-}

module AnyZero (anyZero) where

import Prelude hiding (any)
import Data.ByteString (ByteString, any)

anyZero :: ByteString -> Bool
anyZero = any (== 0)

{-
Rule fired: Class op fromInteger (BUILTIN)
Rule fired: integerToWord (BUILTIN)
Rule fired: narrow8Word# (BUILTIN)
Rule fired: Class op == (BUILTIN)
Rule fired: ByteString specialise any (== x) (Data.ByteString)

==================== Tidy Core ====================
...
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
anyZero :: ByteString -> Bool
anyZero = anyByte anyZero1
-}

@treeowl
Copy link

treeowl commented Sep 27, 2020

Nice surprise!

@treeowl
Copy link

treeowl commented Sep 27, 2020

I remember rules on class methods not only not firing but also screwing up other optimizations somehow. I guess they fixed it.

@chessai
Copy link
Member

chessai commented Sep 27, 2020 via email

@Bodigrim
Copy link
Contributor Author

Cf. #70.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimization potential in the folding functions
8 participants