Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freeze sometimes performs better than unsafeFreeze #409

Closed
julmb opened this issue Aug 26, 2021 · 9 comments
Closed

freeze sometimes performs better than unsafeFreeze #409

julmb opened this issue Aug 26, 2021 · 9 comments

Comments

@julmb
Copy link

julmb commented Aug 26, 2021

I have encountered a situation where freeze is actually faster than unsafeFreeze. Minimal example:

import Control.Monad.ST
import Criterion.Main
import Data.Vector.Unboxed
import Data.Vector.Unboxed.Mutable

go :: Int -> Vector Double
go n = runST $ do
    pv <- unsafeNew n
    let w 0 i = return ()
        w n i = unsafeWrite pv i 42.0 >> w (n - 1) (i + 1)
    w n 0
    unsafeFreeze pv

main :: IO ()
main = defaultMain [bench "freeze" $ whnf go 1000000]

Compiled with ghc -O2 -fllvm freeze.hs, this gives me a mean running time of 985.8μs when using unsafeFreeze and 642.8μs when using freeze. How can this be? Shouldn't unsafeFreeze always be faster since it performs strictly less work?

I am using ghc 8.10.5 and llvm 12.0.1.

@lehins
Copy link
Contributor

lehins commented Aug 26, 2021

I can't exactly reproduce this problem on my machine, much will depend on llvm and ghc versions (please report yours).

When compiled with llvm I see that unsafeFreeze version is indeed slower, but barely:

$ stack exec -- ghc -O2 freeze.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main             ( freeze.hs, freeze.o )
You are using an unsupported version of LLVM!
Currently only 7 is supported. System LLVM version: 6.0.1
We will try though...
Linking freeze ...
$ ./freeze 
benchmarking freeze
time                 1.434 ms   (1.383 ms .. 1.505 ms)
                     0.988 R²   (0.978 R² .. 0.997 R²)
mean                 1.428 ms   (1.402 ms .. 1.468 ms)
std dev              103.3 μs   (74.59 μs .. 144.5 μs)
variance introduced by outliers: 56% (severely inflated)

benchmarking unsafeFreeze
time                 1.559 ms   (1.532 ms .. 1.594 ms)
                     0.996 R²   (0.991 R² .. 1.000 R²)
mean                 1.537 ms   (1.527 ms .. 1.564 ms)
std dev              53.98 μs   (28.68 μs .. 108.7 μs)
variance introduced by outliers: 23% (moderately inflated)

However, when compiled natively the result is even better than llvm and unsafe version is almost double the performance:

$ stack exec -- ghc -O2 freeze.hs -fforce-recomp
[1 of 1] Compiling Main             ( freeze.hs, freeze.o )
Linking freeze ...
$ ./freeze 
benchmarking freeze
time                 1.490 ms   (1.469 ms .. 1.512 ms)
                     0.995 R²   (0.990 R² .. 0.998 R²)
mean                 1.558 ms   (1.526 ms .. 1.602 ms)
std dev              126.7 μs   (86.57 μs .. 164.1 μs)
variance introduced by outliers: 61% (severely inflated)

benchmarking unsafeFreeze
time                 895.3 μs   (878.4 μs .. 911.7 μs)
                     0.996 R²   (0.990 R² .. 0.999 R²)
mean                 910.7 μs   (896.4 μs .. 954.8 μs)
std dev              71.18 μs   (33.01 μs .. 144.1 μs)
variance introduced by outliers: 63% (severely inflated)

@julmb
Copy link
Author

julmb commented Aug 26, 2021

I have added version information to my original post.

I tried the native code generator and I get the following:

$ ghc -O2 freeze.hs && ./freeze
[1 of 1] Compiling Main             ( freeze.hs, freeze.o )
Linking freeze ...
benchmarking freeze
time                 819.5 μs   (806.3 μs .. 832.0 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 830.8 μs   (822.5 μs .. 843.2 μs)
std dev              32.51 μs   (23.72 μs .. 52.26 μs)
variance introduced by outliers: 30% (moderately inflated)

benchmarking unsafeFreeze
time                 718.6 μs   (714.9 μs .. 721.6 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 720.2 μs   (717.8 μs .. 723.8 μs)
std dev              9.737 μs   (6.709 μs .. 17.12 μs)

And with LLVM:

$ ghc -O2 -fllvm freeze.hs && ./freeze
[1 of 1] Compiling Main             ( freeze.hs, freeze.o )
You are using an unsupported version of LLVM!
Currently only 10 to 12 is supported. System LLVM version: 12.0.1
We will try though...
Linking freeze ...
benchmarking freeze
time                 658.7 μs   (650.8 μs .. 668.2 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 670.7 μs   (663.4 μs .. 679.7 μs)
std dev              26.76 μs   (20.40 μs .. 36.87 μs)
variance introduced by outliers: 32% (moderately inflated)

benchmarking unsafeFreeze
time                 967.5 μs   (955.3 μs .. 975.9 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 958.4 μs   (954.0 μs .. 963.7 μs)
std dev              16.22 μs   (13.12 μs .. 20.98 μs)

So the native code generator's unsafeFreeze is still slightly slower than freeze with LLVM. Although at least within the native code generator, things behave as expected. Also, we seem to be getting very different results courtesy of our different versions!

I am now even more puzzled. What could be causing such behavior and what can I do about it? I need LLVM as it performs much better than the native code generator on other parts of the software.

@andrewthad
Copy link
Contributor

The difference between freeze and unsafeFreeze is that freeze performs a memcpy and unsafeFreeze just flips a bit.

You may want to check alignment of the loop. One thing that can help here is marking go with a noinline pragma. Also, for LLVM, can you instruct GHC to dump the assembly (both for freeze and unsafeFreeze)? Just paste the assembly for the go function, not for main. You may need to add the noinline pragma before dumping assembly since otherwise, it'll get inlined, and you'll never be able to figure which code is for go.

@lehins
Copy link
Contributor

lehins commented Aug 26, 2021

I'd suggest to report this issue on ghc issue tracker however. There is nothing can be done in vector about this issue

FYI, just for my own sanity checked performance with PrimArray from primitive and it is identical

@lehins
Copy link
Contributor

lehins commented Aug 26, 2021

@julianbrunner Can you try this implementation instead. It fixed the issue for me:

go :: Int -> Vector Double
go n = runST $ do
    pv <- unsafeNew n
    let w 0 !i = return ()
        w n !i = unsafeWrite pv i 42.0 >> w (n - 1) (i + 1)
    w n 0
    unsafeFreeze pv

@Shimuuar
Copy link
Contributor

I can reproduce this with GHC8.10 and LLVM9. Results are summarized in table below. ! means version with strict accumulator as proposed by @lehins

|          | unsafeFreeze | freeze |
|----------+--------------+--------|
| NCG      |          550 |    730 |
| NCG(!)   |          390 |    740 |
| LLVM9    |         1550 |    730 |
| LLVM9(!) |          420 |    710 |

Performance with strict accumulator is essentially same for both NCG and LLVM. For lazy accumulator LLVM generates something bad. This I think bug in LLVM code generator and reason for poor performance could only be found by inspecting assembly.

@julmb
Copy link
Author

julmb commented Aug 27, 2021

@julianbrunner Can you try this implementation instead. It fixed the issue for me:

go :: Int -> Vector Double
go n = runST $ do
    pv <- unsafeNew n
    let w 0 !i = return ()
        w n !i = unsafeWrite pv i 42.0 >> w (n - 1) (i + 1)
    w n 0
    unsafeFreeze pv

Yes, this fixes the issue. I am still quite puzzled as to what is going on though. I might investigate some more when I get the chance.

@lehins
Copy link
Contributor

lehins commented Aug 27, 2021

@julianbrunner I suggest you open an issue on ghc issue tracker about this. It looks like ghc strictness analyzer is not interacting well with llvm, which I am more than sure is not Vector/ByteArray specific and can pop up in other scenarios.

I am closing this issue, as there is nothing we can do in vector to alleviate this, beside suggesting users to be more strict

@lehins lehins closed this as completed Aug 27, 2021
@julmb
Copy link
Author

julmb commented Aug 27, 2021

@julianbrunner I suggest you open an issue on ghc issue tracker about this. It looks like ghc strictness analyzer is not interacting well with llvm, which I am more than sure is not Vector/ByteArray specific and can pop up in other scenarios.

I am closing this issue, as there is nothing we can do in vector to alleviate this, beside suggesting users to be more strict

Done: https://gitlab.haskell.org/ghc/ghc/-/issues/20302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants