-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert folds to take two arguments #345
Conversation
I can certainly reproduce your benchmarks with roughly the same results. Before:
After:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Please add comments (not haddock ones) explaining reasons behind these lambdas + update .hlint.yaml
.
@Bodigrim I added comments, updated hlint and also I applied the same transformation to similar functions. Every benchmark improves quite a bit as functions are actually getting inlined now. |
Great! Could you please quote results, at least for the functions with pre-existing benchmarks? |
Here are the relevant benchmarks (just "folds" in BenchAll): Old results:
New results:
|
I added your other suggestions too @Bodigrim and have put it all into one commit. |
I do not quite understand why |
My only guess is that filter/scan getting inlined means that the predicate/accumulator can also get inlined and that can be much better optimized later in the pipeline that core. I really don't know enough to say though. |
Yeah, sounds plausible, probably inlining a predicate allows to avoid boxing/unboxing of |
What I'm trying to grasp is how much of the speed up can be attributed purely to the deficiencies of eta-reduced expressions used in benchmarks. It's probably correct that the most drastic impact is not due to inlining itself, but due to complete elimination of memory allocation. It would be nice to have a robust way to utilise this, not relying on fragile pragams. Is it an evidence to expose a set of functions, taking callbacks of form |
These two callbacks can be encoded as a 256 bit mask and a uint8_t[256*256] lookup table respectively and the implementation of some folds can be done in C entirely. |
@ethercrow yeah, interesting idea. One could have a Template Haskell routine, precomputing masks/lookup tables during compilation and generating the most efficient code for each particular case. @Boarders could you please check, whether replacing |
@Bodigrim: These are the results if we just replace INLINE by INLINABLE:
In particular, it looks like it works for |
OK, let's deal with folds first. Hopefully, it will provide us with some insight. To summarise discussed above and in tickets #23 and #329, we have definitions like foldl' f z bs = <...>
{-# INLINE foldl' #-} and benchmarks of form benchFoldl' = foldl' (+) 0 This synthetic benchmark does not perform well, because |
Curiouser and curiouser! In GHC 9.0 performance gains for |
If they've become slower than they were with GHC 8.10, that would probably be worth reporting to the GHC devs. |
I assume |
I've not looked in details, but it could be related to https://gitlab.haskell.org/ghc/ghc/-/issues/18993 |
This same issue has been posted about on GHC issue trackers for |
Let's merge this. Bottom line is that this patch increases inlining opportunities, is the most conservative way to do so, and does not make anything worse. Benchmarks are clearly in favor of it. And While it would be interesting to understand why When this is merged, we can raise a question about GHC 9.0 regression in GHC issue tracker. There is a bunch of inlining regressions in 9.0.1, I wonder if 9.0.2 would be better. |
@Bodigrim When I next get a free cycle(which shouldn't be too long) I will experiment with your unboxed callback idea and @ethercrow's suggestion to see what they do for the benchmarks. For the latte, I will merely say whether the improvement is particularly good using some naive approach and then make an issue where it can be decided how best to utilise that technique. |
Thanks, @Boarders! |
This pull request changes the fold functions of the form:
fold f z bs
tofold f z = \bs -> ...
fixing issue #329.On my machine it looked like this significantly improved the folds benchmarks but I am unsure if that is because I was running in a noisy environment or some other reason e.g.
Before:
After:
This was done using
ghc-8.10.2
.