break failing rewrite to breakByte and failing to eliminate boxing/unboxing #70

twhitehead · 2016-02-14T07:05:04Z

The following simple program (sucks in stdin in chunks of 32K and splits on newlines) demonstrates (I believe) some issues with break routine on GHC 7.10.3, 7.8.4, 7.6.3, 7.4.2, 7.2.2, and 7.0.4.

main :: IO ()
main = goEmpty stdin
  where

    goEmpty :: Handle -> IO ()
    goEmpty handle = do
      leftover <- DB.hGetSome handle 32768
      case DB.null leftover of
       True  -> return ()
       False -> goSome handle leftover

    goUnsure :: Handle -> ByteString -> IO ()
    goUnsure handle leftover = do
      case DB.null leftover of
       True  -> goEmpty handle
       False -> goSome  handle leftover

    goSome :: Handle -> ByteString -> IO ()
    goSome handle leftover = assert (not (DB.null leftover)) $ do
      let (prefix,suffix) = DB.break (== c2w '\n') leftover -- DB.breakByte (c2w '\n') leftover
      case DB.null suffix of
       True  -> goEmpty  handle
       False -> goUnsure handle (DB.tail suffix)

The rewrite rule is not translating the break (== c2w '\n') into breakByte (c2w '\n'). This is confirmed by -ddump-simpl or replacing break with breakByte and observing the performance difference.

{-# INLINE [1] break #-}
#endif

{-# RULES
"ByteString specialise break (x==)" forall x.
    break ((==) x) = breakByte x
"ByteString specialise break (==x)" forall x.
    break (==x) = breakByte x
  #-}

The optimizer is failing to eliminate the boxing/unboxing of the offset returned by the findIndexOrEnd call in the non-breakByte implementation. This is confirmed by -ddump-simpl or viewing the memory allocation stats on a run over a large amount of input.

break :: (Word8 -> Bool) -> ByteString -> (ByteString, ByteString)
break p ps = case findIndexOrEnd p ps of n -> (unsafeTake n ps, unsafeDrop n ps)

findIndexOrEnd :: (Word8 -> Bool) -> ByteString -> Int
findIndexOrEnd k (PS x s l) =
    accursedUnutterablePerformIO $
      withForeignPtr x $ \f ->
        go (f `plusPtr` s) 0
  where
    go !ptr !n | n >= l    = return l
               | otherwise = do w <- peek ptr
                                if k w
                                  then return n
                                  else go (ptr `plusPtr` 1) (n+1)

Note that the allocations from this boxing/unboxing can be quite significant. As an example, I have around 10.7GB of input data that my original code runs on. When split it on '\n', the system indicates the non-breakByte version does an additional 3GB of allocations over the 10.8GB used by the byteBreak version (which doesn't have the boxing/unboxing issues). Splitting on the more frequently occuring ':' pushes it up to 34GB of additional allocations (yes 34GB which is why I noticed it).

For completeness, here is the -O2 -ddump-simpl -dsuppress-all -dno-suppress-idinfo -output of the key loop for GHC 7.10.3 (ByteString 0.10.6.0). You can see it has failed to use the breakByte implementation. It has also failed to eliminate the boxing of the offset that findIndexOrEnd ($wa1_s3lI) returns (I# ww5_s3lG or l_a1r7 = I# ww3_s3lR) in the non-breakByte implementation`

$wa_r3nK
[GblId, Arity=6, Str=DmdType <L,U><L,U><L,U><L,U><L,U><L,U>]
$wa_r3nK =
  \ w_s3lJ ww_s3lO ww1_s3lP ww2_s3lQ ww3_s3lR w1_s3lL [OS=OneShot] ->
    let {
      l_a1r7
      [LclId, Str=DmdType m]
      l_a1r7 = I# ww3_s3lR } in
    letrec {
      $wa1_s3lI [InlPrag=[0], Occ=LoopBreaker]
      [LclId, Arity=3, Str=DmdType <L,U><L,U><L,U>]
      $wa1_s3lI =
        \ ww4_s3lC ww5_s3lG w2_s3lz [OS=OneShot] ->
          case tagToEnum# (>=# ww5_s3lG ww3_s3lR) of _ [Occ=Dead] {
            False ->
              case readWord8OffAddr# ww4_s3lC 0 w2_s3lz
              of _ [Occ=Dead] { (# ipv2_a1t8, ipv3_a1t9 #) ->
              case ipv3_a1t9 of _ [Occ=Dead] {
                __DEFAULT ->
                  $wa1_s3lI (plusAddr# ww4_s3lC 1) (+# ww5_s3lG 1) ipv2_a1t8;
                __word 10 -> (# ipv2_a1t8, I# ww5_s3lG #)
              }
              };
            True -> (# w2_s3lz, l_a1r7 #)
          }; } in
    case $wa1_s3lI (plusAddr# ww_s3lO ww2_s3lQ) 0 realWorld#
    of _ [Occ=Dead] { (# ipv_a1tv, ipv1_a1tw #) ->
    case touch# ww1_s3lP ipv_a1tv
    of _ [Occ=Dead, OS=OneShot] { __DEFAULT ->
    case ipv1_a1tw of _ [Occ=Dead] { I# y_a1xD ->
    let {
      dt3_a1r0
      [LclId, Str=DmdType]
      dt3_a1r0 = -# ww3_s3lR y_a1xD } in
    case tagToEnum# (<=# dt3_a1r0 0) of _ [Occ=Dead] {
      False ->
        let {
          dt1_X1rx
          [LclId, Str=DmdType]
          dt1_X1rx = -# dt3_a1r0 1 } in
        case tagToEnum# (<=# dt1_X1rx 0) of _ [Occ=Dead] {
          False ->
            $wa_r3nK
              w_s3lJ
              ww_s3lO
              ww1_s3lP
              (+# (+# ww2_s3lQ y_a1xD) 1)
              dt1_X1rx
              w1_s3lL;
          True -> main2 w_s3lJ w1_s3lL
        };
      True -> main2 w_s3lJ w1_s3lL
    }
    }
    }
    }

Thanks! -Tyson

PS: I don't see how this boxing/unboxing survived. Both I# ww5_s3lG and l_a1r7 = I# ww3_s3lR clearly cannot be bottom, and the $wa1_s3lI call site immediately turns around and unboxes the returned value. I also don't see why all the DmdType values are <L,U> as the functions clearly force many of their arguments to at least WHNF regardless of the path taken through them.

The text was updated successfully, but these errors were encountered:

twhitehead · 2016-02-17T04:05:38Z

I've spent some time digging into this and it seems the issue may be that the worker/wrapper transformation doesn't mange to unbox a value that occurs inside of the unboxed tuple.

Specifically, in this case, the boxing and unboxing associated with the offset computed by findIndexOrEnd is not eliminated by the worker/wrapper system as it is returned as part of an unboxed tupple due to being computed in an IO monad.

To see if this was possibly right I wrote a version of findIndexOrEnd that wraps the returned value outside of the IO monad instead of inside it. I couldn't figure out any way to do this other than to manually expand all the function definitions and write it using the various GHC.* modules as the IO monad is only parameterized over kind * (i.e., boxed and not unboxed types).

findIndexOrEnd :: (Word8 -> Bool) -> ByteString -> Int                                                 
findIndexOrEnd k (PS (ForeignPtr a# c) s (I# l#)) =                                                    
  case go a# l# 0# realWorld# of                                                                         
   (# s#, r #) -> case touch# c s# of                                                                    
     s# -> I# r                                                                                          
  where                                                                                                  
    go :: Addr# -> Int# -> Int# -> State# RealWorld -> (# State# RealWorld, Int# #)                      
    go a# l# n# s# = case tagToEnum# (n# >=# l#) :: Bool of                                              
                      True  -> (# s#, l# #)                                                              
                      False -> case readWord8OffAddr# a# n# s# of                                        
                                (# s#, w #) ->                                                           
                                  case k (W8# w) of                                                      
                                   True  -> (# s#, n# #)                                                 
                                   False -> go a# l# (n# +# 1#) s#

Compiling with -O2 and running against my 10.7G input file, the results are as follows

	Original	Unboxed	Optimized	Improvement
Time (ncg)	15.6s	12.7s	12.6s	1.2x
Time (llvm)	11.2s	8.3s	7.4s	1.5x
Memory	44.9G	10.8G	10.8G	4.2x

where optimized has go rewritten to count down instead of up to eliminate a register. I was also explicit about passing k but that makes no difference here as k is fully inlined and eliminated.

    go :: (Word8 -> Bool) -> Addr# -> Int# -> State# RealWorld -> (# State# RealWorld, Int# #)           
    go k a# n# s# = case tagToEnum# (n# <=# 0#) :: Bool of                                               
                   True  -> (# s#, n# #)                                                                 
                   False -> case readWord8OffAddr# a# 0# s# of                                           
                             (# s#, w #) ->                                                              
                               case k (W8# w) of                                                         
                                True  -> (# s#, n# #)                                                    
                                False -> go k (plusAddr# a# 1#) (n# -# 1#) s#

Assuming I'm correct that the underlying issue here is the compiler won't box and unbox under the unboxed tuple, I wonder if it would be possible to extend the worker/wrapper system to do so. Possibly there may be more code like this that could benefit from this sort of optimization?

dcoutts · 2016-03-04T15:05:21Z

Detailed writeup, thanks!

I don't have a whole lot of time at the moment to dig into this. I'd really appreciate if anyone else can uncover anything and suggest specific solutions (better, PRs).

alexbiehl · 2016-03-08T08:04:52Z

I filled a bug in GHC Trac: https://ghc.haskell.org/trac/ghc/ticket/11688

bgamari · 2016-03-08T10:02:50Z

I suspect the issue here is that (==) is being inlined too early, hiding the rule fire opportunity. By contrast, if you give (==) another name with a later inline pragma the break -> breakByte rule fires as expected.

main = do
  chunk <- DB.hGetSome stdin 32768
  let newline = c2w '\n'
  let (prefix,suffix) = DB.break (eq newline) chunk
  DB.hPut stdout prefix
  DB.hPut stdout suffix

eq :: Eq a => a -> a -> Bool
eq = (==)
{-# INLINE [1] eq #-}

{-# RULES
"ByteString specialise break (eq x)" forall x.
    DB.break (eq x) = DB.breakByte x
  #-}

Actually, even if one doesn't specify an explicit phase on eq the rule still fires! This is likely accidental, since GHC just happens to look at the breakByte rule before the eq inlining.

Previously these were matching on (==), which was rewritten by the class op rule before the breakByte rule had an opportunity to fire (haskell#70). Unfortunately fixing this requires that we change the Eq instances provided by GHC. This has been done in GHC 8.0.1 (base-4.9.0).

twhitehead · 2016-03-08T21:27:23Z

Thanks @alexbiehl for opening the GHC ticket and @bgamari for figuring out the rewrite issue.

This is just a quick note that, after writing this up and continuing to play with it, I eventually discovered GHC track ticket 2289 and friends (e.g., 1600, 2387, etc.) It seems the inability of the compiler to unbox multiple return values is a well know issue going back many years.

It's particularly annoying in this case though as there is really only one return value. The other is just the IO state token which we know the compiler will eventually throw away anyway. Until someone comes up with a compiler fix I don't think there is anything that can be done other than maybe using continuation passing to turn returning into calling when possible.

Previously these were matching on (==), which was rewritten by the class op rule before the breakByte rule had an opportunity to fire (haskell#70). Unfortunately fixing this requires that we change the Eq instances provided by GHC. This has been done in GHC 8.0.1 (base-4.9.0).

sjakobi · 2020-06-25T17:05:02Z

Was this issue fixed by #71 or is there more to do here?

twhitehead · 2020-06-28T15:37:21Z

There were two issues here

the rewrite rule was not translating the break (== c2w '\n') into breakByte (c2w '\n'), and
the fact GHC doesn't do nested CPR.

I think we can close this ticket though as the first looks like it was addressed by #71 and the second is a well known compiler issue that has it own ticket open.

bgamari mentioned this issue Mar 24, 2016

Fix break rewrite rules #71

Merged

Yuras mentioned this issue Apr 5, 2016

Eliminate boxing in takeWhile Yuras/scanner#4

Open

sjakobi added the performance label Jun 25, 2020

twhitehead closed this as completed Jun 28, 2020

Bodigrim mentioned this issue Sep 27, 2020

Optimize folds #273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

break failing rewrite to breakByte and failing to eliminate boxing/unboxing #70

break failing rewrite to breakByte and failing to eliminate boxing/unboxing #70

twhitehead commented Feb 14, 2016

twhitehead commented Feb 17, 2016

dcoutts commented Mar 4, 2016

alexbiehl commented Mar 8, 2016

bgamari commented Mar 8, 2016

twhitehead commented Mar 8, 2016

sjakobi commented Jun 25, 2020

twhitehead commented Jun 28, 2020

break failing rewrite to breakByte and failing to eliminate boxing/unboxing #70

break failing rewrite to breakByte and failing to eliminate boxing/unboxing #70

Comments

twhitehead commented Feb 14, 2016

twhitehead commented Feb 17, 2016

dcoutts commented Mar 4, 2016

alexbiehl commented Mar 8, 2016

bgamari commented Mar 8, 2016

twhitehead commented Mar 8, 2016

sjakobi commented Jun 25, 2020

twhitehead commented Jun 28, 2020