Binary decoding is slow #1804

sjakobi · 2020-05-21T15:44:09Z

There have been a few reports of slow binary decoding performance on Discourse recently:

In general, people seem to get only about 10 MB/s.

I've tried profiling dhall decode with the new --quiet option, but the output only points at two functions from cborg:

COST CENTRE            MODULE              SRC                                      %time %alloc

getDecodeAction        Codec.CBOR.Decoding src/Codec/CBOR/Decoding.hs:311:1-55       81.5   88.7
deserialiseIncremental Codec.CBOR.Read     src/Codec/CBOR/Read.hs:(165,1)-(167,46)   18.5   11.0

Things to investigate:

Might high rates of branch mispredictions slow us down?

Things to try:

Map and Set deserialization in the style of Deserializing maps and sets haskell/containers#405
Adopt a UTF8-based string type for record and union labels etc.: Try ShortText or Basement.String instead of Text for strings in Expr #1032
Tweak RTS options. A larger nursery might be useful?

The text was updated successfully, but these errors were encountered:

sjakobi · 2020-05-21T15:46:01Z

I've also tried looking at the core, but it's fairly impenetrable to me.

@Gabriel439 do you have any other suggestions what we could do except ask for help on the cborg issue tracker?

Gabriella439 · 2020-05-21T15:47:22Z

@sjakobi: I ran into the same issue and I also don't know how to make progress

sjakobi · 2020-05-21T16:04:34Z

I've opened well-typed/cborg#236.

sjakobi · 2020-05-22T15:19:15Z

While looking at decodeExpressionInternal I noticed that we're frequently using Control.Monad.replicateM, which I suspected to be lazier than ideal. I tried replacing it with

replicateM :: Int -> Decoder s a -> Decoder s [a]
replicateM n d = Vector.toList <$> Vector.replicateM n d

Vector.replicateM has a Monad constraint instead of Applicative. But that didn't seem to impact performance at all.

I've also just tried this version:

replicateM :: Int -> Decoder s a -> Decoder s [a]
replicateM n d = go n
  where
    go 0 = return []
    go n = do
        !x <- d
        (x :) <$> go (n - 1)

That seems to improve speed in the range of 5-10%. PR incoming.

Given the sheer size of decodeExpressionInternal, I've also wondered whether it might be worth trying to move some things "off the hot path". I believe I've read some comments by Harendra Kumar regarding such an optimization. I don't really expect a lot of benefit from that, though.

Gabriella439 · 2020-05-22T15:42:17Z

@sjakobi: Another thing that might further improve performance is eliminating the list intermediate. In many cases the list returned by replicateM is then being transformed into yet another data structure (such as nested Apps or a Dhall.Map). If we were to directly assemble the target data structure without going through the list intermediate it might be faster.

sjakobi · 2020-05-22T16:30:22Z

@Gabriel439 Yeah. For ListLit we could try Data.Sequence.replicateM.

Regarding Map (and Set) we could look at haskell/containers#405, but I think we shouldn't rely on the input being in ascending order already.

Gabriella439 · 2020-05-22T16:42:06Z

@sjakobi: Well, if it is an expression protected by an integrity check we could rely on the input being sorted. The Serialise instance could make no such assumptions, but we could expose extra utilities outside that type-class that are faster for the case where the fields are sorted. It's probably worth checking to at least see if the sorted assumption leads to a performance improvement

This reduces timings for decoding the cache for https://raw.githubusercontent.com/vmchale/cpkg/0003ec5af2859008424aec06c2db0437a6a22ff9/pkgs/pkg-set.dhall with `dhall decode --quiet` by 8.7% to 12.3%. Context: #1804

sjakobi · 2020-05-22T21:38:02Z

For ListLit we could try Data.Sequence.replicateM.

I've tried this. It seems to actually reduce performance for my cpkg benchmark, possibly due the Applicative constraint impacting strictness analysis.

Gabriella439 · 2020-05-22T21:53:33Z

@sjakobi: Also, if the profile is not very informative, you can fix that by adding extra cost annotations to the code using the {-# SCC "myAnnotation" #-} pragma. That's what I do whenever I need to narrow things down. See also:

https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/profiling.html#inserting-cost-centres-by-hand

Also, I highly recommend using the profiteur tool for better understanding a profile:

https://github.com/jaspervdj/profiteur

This reduces timings for decoding the cache for https://raw.githubusercontent.com/vmchale/cpkg/0003ec5af2859008424aec06c2db0437a6a22ff9/pkgs/pkg-set.dhall with `dhall decode --quiet` by 8.7% to 12.3%. Context: #1804

sjakobi · 2020-05-23T11:22:15Z

I've experimented a bit with the map decoding (the branch is here), but haven't seen any substantial speed-ups yet. I suspect that maybe benchmarking this stuff with dhall decode --quiet isn't such a good idea after all, since that doesn't actually force the resulting expression. ;)

I'll try some more profiling too, although I do wonder how the profiling might affect the structure of the generated code and its optimizations.

…to aid profiling. Context: #1804.

sjakobi · 2020-05-23T14:31:14Z

Profiling with the cost centres added in #1808 reveals what kind of expressions we spend most decoding time on:

Prelude

COST CENTRE            MODULE                     SRC                                                 %time %alloc

deserialiseIncremental Codec.CBOR.Read            src/Codec/CBOR/Read.hs:(165,1)-(167,46)              22.2   17.9
dEI_builtin            Dhall.Binary               src/Dhall/Binary.hs:(159,51)-(197,91)                16.7   12.7
dEI_Merge              Dhall.Binary               src/Dhall/Binary.hs:(343,56)-(358,124)               16.7    2.5
dEI_Var-_-UInt         Dhall.Binary               src/Dhall/Binary.hs:(129,54)-(132,53)                11.1    7.2
dEI_App                Dhall.Binary               src/Dhall/Binary.hs:(236,54)-(245,56)                11.1   14.3
evalParser             Options.Applicative.Common src/Options/Applicative/Common.hs:(237,1)-(241,56)    5.6    0.0
dEI_RecordLit          Dhall.Binary               src/Dhall/Binary.hs:(372,60)-(382,75)                 5.6    2.2
dEI_Lam                Dhall.Binary               src/Dhall/Binary.hs:(247,54)-(270,118)                5.6    7.1
dEI_BoolLit            Dhall.Binary               src/Dhall/Binary.hs:(154,49)-(157,34)                 5.6    0.4
dEI_operator           Dhall.Binary               src/Dhall/Binary.hs:(297,59)-(321,47)                 0.0    1.5
dEI_Union              Dhall.Binary               src/Dhall/Binary.hs:(420,57)-(441,71)                 0.0    3.6
dEI_TextLit            Dhall.Binary               src/Dhall/Binary.hs:(503,59)-(513,63)                 0.0    1.3
dEI_Record             Dhall.Binary               src/Dhall/Binary.hs:(360,57)-(370,72)                 0.0   18.5
dEI_ListLit            Dhall.Binary               src/Dhall/Binary.hs:(323,58)-(334,92)                 0.0    2.9
dEI_Field              Dhall.Binary               src/Dhall/Binary.hs:(384,56)-(389,50)                 0.0    2.0
dEI_BoolIf             Dhall.Binary               src/Dhall/Binary.hs:(443,58)-(450,53)                 0.0    3.2

cpkg

COST CENTRE            MODULE          SRC                                      %time %alloc

deserialiseIncremental Codec.CBOR.Read src/Codec/CBOR/Read.hs:(165,1)-(167,46)   42.5   27.1
dEI_builtin            Dhall.Binary    src/Dhall/Binary.hs:(159,51)-(197,91)     15.4    8.3
dEI_TextLit            Dhall.Binary    src/Dhall/Binary.hs:(503,59)-(513,63)      9.6   13.7
dEI_Record             Dhall.Binary    src/Dhall/Binary.hs:(360,57)-(370,72)      7.1    7.6
dEI_Union              Dhall.Binary    src/Dhall/Binary.hs:(420,57)-(441,71)      5.0   16.5
dEI_Merge              Dhall.Binary    src/Dhall/Binary.hs:(343,56)-(358,124)     4.2    4.5
dEI_Field              Dhall.Binary    src/Dhall/Binary.hs:(384,56)-(389,50)      3.3    2.5
dEI_Lam                Dhall.Binary    src/Dhall/Binary.hs:(247,54)-(270,118)     2.9    5.5
dEI_Var-_-UInt         Dhall.Binary    src/Dhall/Binary.hs:(129,54)-(132,53)      2.5    3.0
dEI_RecordLit          Dhall.Binary    src/Dhall/Binary.hs:(372,60)-(382,75)      2.5    2.7
dEI_App                Dhall.Binary    src/Dhall/Binary.hs:(236,54)-(245,56)      2.1    4.2
dEI_BoolIf             Dhall.Binary    src/Dhall/Binary.hs:(443,58)-(450,53)      0.8    1.1
dEI_BoolLit            Dhall.Binary    src/Dhall/Binary.hs:(154,49)-(157,34)      0.4    1.7

Note that for dEI_Merge, dEI_Record, etc., much of the time and allocations are inherited from decoding the subexpressions.

dEI_builtin looks promising though. While looking at the Core, I had already noticed that this section of the code is absolutely massive. I think it's mostly the string (Text) matching that is so expensive.

We could try optimizing this by

Not decoding to Text, thereby avoiding the conversion to UTF-16. Instead we could use decodeUtf8ByteArray.
Optimize the string matching, for example by checking the length or the first character first.

I also wonder why the builtins are encoded in this string format at all. It would be much cheaper if they were encoded with simple integer tags like the operators.

Regarding the other strings, for variables, map keys, etc. we'd probably also profit to switching to UTF8, for example by using text-short. Of course we'd have to pay for conversion to UTF16 in other places then…

For App, I'll soon submit a patch that avoids the intermediate list representation.

…to aid profiling. Context: #1804.

Gabriella439 · 2020-05-24T17:42:25Z

@sjakobi: The original rationale for encoding built-ins in that way was to treat them essentially the same as free variables

I agree that not decoding built-ins to UTF8 is probably a performance win

This seems to speed up decoding of Prelude and cpkg caches by 1-2%. Context: #1804.

* Use decodeUtf8ByteArray to avoid UTF16-encoding the scrutinee. * Optimize the pattern matching by grouping the patterns by length. GHC currently doesn't produce static length information for string literals. Consequently the pattern matching worked somewhat like this: s <- decodeString let len_s = length s if len_s == length "Natural/build" && sameBytes s "Natural/build" then return NaturalBuild else if len_s == length "Natural/fold" && sameBytes s "Natural/fold" ... Decoding `Sort`, the most extreme case, would involve a total of 32 conditional jumps as a consequence of length comparisons alone. Judging by the Core, we can get that number down to 8 by grouping the patterns by length: One to check the length of the decoded string, and (unfortunately) still one each for the 7 candidate literals of length 4. The number of string content comparisons should be unchanged. The result of these optimizations is that the time to decode the cache for cpkg is reduced by 7-9%. Decoding time for the Prelude goes down by 13-16%. This also changes the builtin encoding to use encodeUtf8ByteArray in order to avoid UTF16-encoding and decoding the builtins strings. I didn't check the performance implications though. Context: #1804.

This seems to speed up decoding of Prelude and cpkg caches by 1-2%. Context: #1804.

* Use decodeUtf8ByteArray to avoid UTF16-encoding the scrutinee. * Optimize the pattern matching by grouping the patterns by length. GHC currently doesn't produce static length information for string literals. Consequently the pattern matching worked somewhat like this: s <- decodeString let len_s = length s if len_s == length "Natural/build" && sameBytes s "Natural/build" then return NaturalBuild else if len_s == length "Natural/fold" && sameBytes s "Natural/fold" ... Decoding `Sort`, the most extreme case, would involve a total of 32 conditional jumps as a consequence of length comparisons alone. Judging by the Core, we can get that number down to 8 by grouping the patterns by length: One to check the length of the decoded string, and (unfortunately) still one each for the 7 candidate literals of length 4. The number of string content comparisons should be unchanged. The result of these optimizations is that the time to decode the cache for cpkg is reduced by 7-9%. Decoding time for the Prelude goes down by 13-16%. This also changes the builtin encoding to use encodeUtf8ByteArray in order to avoid UTF16-encoding and decoding the builtins strings. I didn't check the performance implications though. Context: #1804. Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

…to aid profiling. Context: #1804.

bgamari · 2020-06-15T19:41:00Z

For ListLit we could try Data.Sequence.replicateM.

I've tried this. It seems to actually reduce performance for my cpkg benchmark, possibly due the Applicative constraint impacting strictness analysis.

In my test (which only involved --quiet) it was (IIRC) a very small, but still positive, effect.

sjakobi added performance All the slow and inefficient things caching labels May 21, 2020

sjakobi mentioned this issue May 22, 2020

Use a specialized version of replicateM for binary decoding #1807

Merged

sjakobi added a commit that referenced this issue May 23, 2020

Add cost centres in decodeExpressionInternal

1dabc0a

…to aid profiling. Context: #1804.

sjakobi mentioned this issue May 23, 2020

Add cost centres in decodeExpressionInternal #1808

Closed

sjakobi added a commit that referenced this issue May 24, 2020

Add cost centres in decodeExpressionInternal

3f10c07

…to aid profiling. Context: #1804.

sjakobi mentioned this issue May 24, 2020

Compute length at compile time for literal strings haskell/bytestring#191

Merged

sjakobi added a commit that referenced this issue May 24, 2020

Skip intermediate list when decoding App

56d1797

This seems to speed up decoding of Prelude and cpkg caches by 1-2%. Context: #1804.

sjakobi added a commit that referenced this issue May 24, 2020

Skip generating the intermediate list when decoding App

900e3db

This seems to speed up decoding of Prelude and cpkg caches by 1-2%. Context: #1804.

sjakobi mentioned this issue May 24, 2020

Skip generating the intermediate list when decoding App #1809

Merged

sjakobi mentioned this issue May 24, 2020

Optimize binary decoding of builtins #1810

Merged

mergify bot pushed a commit that referenced this issue May 24, 2020

Skip generating the intermediate list when decoding App (#1809)

c43bfd2

This seems to speed up decoding of Prelude and cpkg caches by 1-2%. Context: #1804.

sjakobi added a commit that referenced this issue May 25, 2020

Add cost centres in decodeExpressionInternal

4ba3d45

…to aid profiling. Context: #1804.

PierreR mentioned this issue May 26, 2020

WIP: Proposal to implement #1615 by abusing the type-checker #1621

Closed

sjakobi mentioned this issue Jun 14, 2020

CBOR decoder performance #1857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary decoding is slow #1804

Binary decoding is slow #1804

sjakobi commented May 21, 2020 •

edited

Loading

sjakobi commented May 21, 2020

Gabriella439 commented May 21, 2020

sjakobi commented May 21, 2020

sjakobi commented May 22, 2020

Gabriella439 commented May 22, 2020

sjakobi commented May 22, 2020

Gabriella439 commented May 22, 2020 •

edited

Loading

sjakobi commented May 22, 2020

Gabriella439 commented May 22, 2020

sjakobi commented May 23, 2020

sjakobi commented May 23, 2020

Gabriella439 commented May 24, 2020

bgamari commented Jun 15, 2020

Binary decoding is slow #1804

Binary decoding is slow #1804

Comments

sjakobi commented May 21, 2020 • edited Loading

sjakobi commented May 21, 2020

Gabriella439 commented May 21, 2020

sjakobi commented May 21, 2020

sjakobi commented May 22, 2020

Gabriella439 commented May 22, 2020

sjakobi commented May 22, 2020

Gabriella439 commented May 22, 2020 • edited Loading

sjakobi commented May 22, 2020

Gabriella439 commented May 22, 2020

sjakobi commented May 23, 2020

sjakobi commented May 23, 2020

Gabriella439 commented May 24, 2020

bgamari commented Jun 15, 2020

sjakobi commented May 21, 2020 •

edited

Loading

Gabriella439 commented May 22, 2020 •

edited

Loading