-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize folds #273
Optimize folds #273
Conversation
abf36cb
to
31e1122
Compare
Benchmarking appeared to be too tempting. Here are results, and I think they are hilarious!
|
Holy crap.
…On Fri, Aug 28, 2020, 13:02 Bodigrim ***@***.***> wrote:
Benchmarking appeared to be too tempting. Here are results, and I think
they are hilarious! foldl' is roughly 5-10% faster and foldr' is 10-15%
faster. Next, filter is 15-20% faster, mapAccumL, mapAccumR and scanl are
20-25% faster, while scanr is 25-30% faster. It would be fantastic to
merge this before 0.11.0.0 release.
Benchmark Before After Speedup
----------------------------------------------
foldl'/96 328.1 ns 328.0 ns 0.03%
foldl'/192 694.2 ns 671.5 ns 3.27%
foldl'/384 1.398 μs 1.333 μs 4.65%
foldl'/768 2.786 μs 2.658 μs 4.59%
foldl'/1536 5.556 μs 5.323 μs 4.19%
foldl'/3072 11.08 μs 10.43 μs 5.87%
foldl'/6144 22.15 μs 20.94 μs 5.46%
foldl'/12288 44.73 μs 41.36 μs 7.53%
foldl'/24576 89.31 μs 82.52 μs 7.60%
foldl'/49152 180.7 μs 164.2 μs 9.13%
foldr'/96 386.3 ns 339.9 ns 12.01%
foldr'/192 742.6 ns 658.3 ns 11.35%
foldr'/384 1.452 μs 1.297 μs 10.67%
foldr'/768 2.879 μs 2.569 μs 10.77%
foldr'/1536 5.746 μs 5.123 μs 10.84%
foldr'/3072 11.54 μs 10.33 μs 10.49%
foldr'/6144 23.23 μs 20.32 μs 12.53%
foldr'/12288 46.08 μs 40.24 μs 12.67%
foldr'/24576 92.97 μs 80.99 μs 12.89%
foldr'/49152 187.0 μs 162.0 μs 13.37%
filter/96 421.7 ns 352.0 ns 16.53%
filter/192 807.4 ns 670.6 ns 16.94%
filter/384 1.587 μs 1.298 μs 18.21%
filter/768 3.146 μs 2.618 μs 16.78%
filter/1536 6.314 μs 5.136 μs 18.66%
filter/3072 12.45 μs 10.20 μs 18.07%
filter/6144 24.72 μs 20.28 μs 17.96%
filter/12288 49.51 μs 40.25 μs 18.70%
filter/24576 98.68 μs 80.45 μs 18.47%
filter/49152 201.1 μs 161.9 μs 19.49%
mapAccumL/96 528.3 ns 402.5 ns 23.81%
mapAccumL/192 1.023 μs 777.7 ns 23.98%
mapAccumL/384 2.021 μs 1.508 μs 25.38%
mapAccumL/768 4.004 μs 3.033 μs 24.25%
mapAccumL/1536 8.027 μs 6.032 μs 24.85%
mapAccumL/3072 16.02 μs 11.97 μs 25.28%
mapAccumL/6144 32.08 μs 23.72 μs 26.06%
mapAccumL/12288 64.24 μs 47.95 μs 25.36%
mapAccumL/24576 125.7 μs 95.53 μs 24.00%
mapAccumL/49152 258.2 μs 195.4 μs 24.32%
mapAccumR/96 512.7 ns 397.0 ns 22.57%
mapAccumR/192 989.2 ns 761.9 ns 22.98%
mapAccumR/384 1.937 μs 1.506 μs 22.25%
mapAccumR/768 3.951 μs 2.995 μs 24.20%
mapAccumR/1536 7.999 μs 5.944 μs 25.69%
mapAccumR/3072 15.54 μs 11.76 μs 24.32%
mapAccumR/6144 31.44 μs 23.33 μs 25.80%
mapAccumR/12288 61.30 μs 46.41 μs 24.29%
mapAccumR/24576 120.9 μs 93.71 μs 22.49%
mapAccumR/49152 246.2 μs 194.0 μs 21.20%
scanl/96 477.3 ns 373.0 ns 21.85%
scanl/192 925.4 ns 702.9 ns 24.04%
scanl/384 1.830 μs 1.389 μs 24.10%
scanl/768 3.557 μs 2.760 μs 22.41%
scanl/1536 7.205 μs 5.453 μs 24.32%
scanl/3072 14.24 μs 11.06 μs 22.33%
scanl/6144 28.79 μs 22.24 μs 22.75%
scanl/12288 56.75 μs 43.64 μs 23.10%
scanl/24576 115.1 μs 86.57 μs 24.79%
scanl/49152 239.4 μs 181.1 μs 24.35%
scanr/96 469.5 ns 342.8 ns 26.99%
scanr/192 915.0 ns 653.5 ns 28.58%
scanr/384 1.797 μs 1.285 μs 28.49%
scanr/768 3.606 μs 2.539 μs 29.59%
scanr/1536 7.133 μs 5.110 μs 29.36%
scanr/3072 14.17 μs 10.08 μs 28.86%
scanr/6144 28.16 μs 20.22 μs 28.20%
scanr/12288 57.23 μs 40.76 μs 28.78%
scanr/24576 112.5 μs 81.62 μs 27.45%
scanr/49152 235.4 μs 165.6 μs 29.65%
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#273 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEOIX2Y2VJQTZZEYNB5I7QLSC7WKHANCNFSM4QNQHV6Q>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up, @Bodigrim! :)
The speed-ups are amazing!
IIUC all these changes do is lift static arguments out of the inner worker functions. Isn't this something that GHC ought to be able to do automatically?! Is this a compiler bug?
I noticed that you didn't add benchmarks for the following functions:
foldl
foldr
any
all
spanByte
splitWith
zipWith'
findIndexOrEnd
No idea whether any of these need additional scrutiny though.
Regarding correctness, I assume that all the changed functions have reasonable test coverage, but I haven't checked.
BTW, how did you produce the table of benchmark results, @Bodigrim? This is a very nice format.
I do not expect GHC being able to do it automatically, certainly not in the case when constant arguments are trailing ones. The closest optimization I know about is call-pattern specialization, but it does not cover our case. One can check Core generated for these functions: {-# LANGUAGE BangPatterns #-}
module Lib (countFromTo, countFromTo') where
countFromTo :: Int -> Int -> Int
countFromTo !from !to = go 0 from to
where
go !acc !f !t
| f == t = acc
| otherwise = go (acc + 1) (f + 1) t
countFromTo' :: Int -> Int -> Int
countFromTo' !from !to = go 0 from
where
go !acc !f
| f == to = acc
| otherwise = go (acc + 1) (f + 1) The inner loop of $wgo :: Int# -> Int# -> Int# -> Int#
$wgo
= \ (ww_s1xz :: Int#) (ww1_s1xD :: Int#) (ww2_s1xH :: Int#) ->
case ==# ww1_s1xD ww2_s1xH of {
__DEFAULT -> $wgo (+# ww_s1xz 1#) (+# ww1_s1xD 1#) ww2_s1xH;
1# -> ww_s1xz
} Notice
It does not make much sense to benchmark lazy
There are tests, but I still encourage reviewers to verify changes closely.
By hand :) |
Anyone else willing to take a look please? @hsyl20 @vdukhovni @fumieval |
y <- peekByteOff p2 n | ||
pokeByteOff r n (f x y) | ||
zipWith_ (n+1) p1 p2 r | ||
go p1 p2 = zipWith_ 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question. Under what conditions is it actually better to avoid passing fixed arguments on the stack in recursive calls, and instead allocate a closure around the fixed arguments, that avoids having to copy them into each stack frame?
I must admit I don't really understand what the compiler does here, but if the new version has to perform closure allocations that the old version does not, then perhaps the cost of carrying the extra arguments around is worth the ultimate reduction in GC pressure.
If the benchmarks only measure speed, but don't measure GC pressure, we may not be doing the right thing here (and elsewhere...). The benchmarks would have to run long enough to take the GC overhead into account...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is always preferable to avoid passing fixed arguments (unless it is a particularly degenerate case). An additional closure (if any) is allocated only once per call, which is not a big deal at all, but additional arguments are passed for each element of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closure allocation is however much more expensive than pushing a parameter onto the stack, so for sufficiently short strings the savings may be small, and the ultimate GC costs non-negligible if one processes a lot of short strings, or even a lot of strings period, sometimes it is better to pay the price as you go than to accumulate a lot of garbage to later clean up and stall.
Mind you, I'd like to hear from @hsyl20 et. al. on what the actual tradeoffs are here. I don't yet have a clear enough sense of when the various costs and benefits come into play...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes closure allocation can be costly. I have a patch for containers that removes this kind of closure for a lookup in IntMap.
In this specific zipWith'
case, we don't allocate any closure as far as I can tell from looking at STG. create
and zipWith_
are inlined and transformed into join points.
31e1122
to
6b1bb6e
Compare
This optimization is called Static Argument Transformation (SAT). It's implemented in GHC but is limited to be applied to functions that have two or more static arguments. There's a performance measurement and some discussions in Compilation by transformation for non-strict functional languages by Andre Santos. As you can see in the thesis, SAT is not always a win. EDIT: Fixed the broken link to the thesis. |
Thanks (though the above link no longer works, I was able to find another copy. |
OK, so according to Santos' thesis, the only downside of SAT is a suspected increase in heap allocations, when we run a function many times on short inputs.
Join point optimization comes to the rescue. For example, foo :: ByteString -> Int
foo
= \ (w_svf0 :: ByteString) ->
case w_svf0 of { BS ww1_svf3 ww2_svf4 ww3_svf5 ->
let {
end_svdU :: Addr#
end_svdU = plusAddr# ww1_svf3 ww3_svf5 } in
join {
exit_Xj :: Int# -> State# RealWorld -> Int
exit_Xj (ww4_sveT :: Int#) (w1_sveQ :: State# RealWorld)
= case touch# ww2_svf4 w1_sveQ of { __DEFAULT -> I# ww4_sveT } } in
joinrec {
$wgo_sveZ :: Int# -> Addr# -> State# RealWorld -> Int
$wgo_sveZ (ww4_sveT :: Int#)
(ww5_sveX :: Addr#)
(w1_sveQ :: State# RealWorld)
= case eqAddr# ww5_sveX end_svdU of {
__DEFAULT ->
case readWord8OffAddr# ww5_sveX 0# w1_sveQ of
{ (# ipv1_a6d7, ipv2_a6d8 #) ->
jump $wgo_sveZ
(+# ww4_sveT (word2Int# ipv2_a6d8))
(plusAddr# ww5_sveX 1#)
ipv1_a6d7
};
1# -> jump exit_Xj ww4_sveT w1_sveQ
}; } in
jump $wgo_sveZ 0# ww1_svf3 realWorld#
} Notice |
And here are benchmarks for short strings. The new version is again consistently faster:
|
066fbde
to
fe5f08d
Compare
Thanks for that, some clarity is emerging. The only remaining question for me is whether all of the modified functions fall into this case, or whether for some reason some slip through the cracks. |
|
Cool. I'm sold. Much appreciated. |
Thanks for the background info and investigation, everyone! :) Ping @hsyl20! |
Indeed it would now perhaps be interesting to know whether the compiler has enough smarts to notice that the values are just passed around unmodified (even at the ends of the argument list), and could/should perhaps perform the same optimisation automatically when it already knows that it is add join points, and there's no associated closure allocation cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this work! LGTM
Do these |
@treeowl They do. {-# OPTIONS_GHC -ddump-rule-firings #-}
module AnyZero (anyZero) where
import Prelude hiding (any)
import Data.ByteString (ByteString, any)
anyZero :: ByteString -> Bool
anyZero = any (== 0)
{-
Rule fired: Class op fromInteger (BUILTIN)
Rule fired: integerToWord (BUILTIN)
Rule fired: narrow8Word# (BUILTIN)
Rule fired: Class op == (BUILTIN)
Rule fired: ByteString specialise any (== x) (Data.ByteString)
==================== Tidy Core ====================
...
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
anyZero :: ByteString -> Bool
anyZero = anyByte anyZero1
-} |
Nice surprise! |
I remember rules on class methods not only not firing but also screwing up other optimizations somehow. I guess they fixed it. |
I think the concern is reasonable, esp. on older GHCs. It would be
interesting if one could construct a minimal example now that causes a
relevant misfire.
…On Sun, Sep 27, 2020, 16:04 David Feuer ***@***.***> wrote:
I remember rules on class methods not only not firing but also screwing up
other optimizations somehow. I guess they fixed it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#273 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEOIX2YHUSGNG24SUNLI7DLSH6SGTANCNFSM4QNQHV6Q>
.
|
Cf. #70. |
This is #151 rebased, see explanations there. Closes #206. CC @sergv