Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary decoding is slow #1804

Open
4 tasks
sjakobi opened this issue May 21, 2020 · 13 comments
Open
4 tasks

Binary decoding is slow #1804

sjakobi opened this issue May 21, 2020 · 13 comments
Labels
caching performance All the slow and inefficient things

Comments

@sjakobi
Copy link
Collaborator

sjakobi commented May 21, 2020

There have been a few reports of slow binary decoding performance on Discourse recently:

In general, people seem to get only about 10 MB/s.

I've tried profiling dhall decode with the new --quiet option, but the output only points at two functions from cborg:

COST CENTRE            MODULE              SRC                                      %time %alloc

getDecodeAction        Codec.CBOR.Decoding src/Codec/CBOR/Decoding.hs:311:1-55       81.5   88.7
deserialiseIncremental Codec.CBOR.Read     src/Codec/CBOR/Read.hs:(165,1)-(167,46)   18.5   11.0

Things to investigate:

  • Might high rates of branch mispredictions slow us down?

Things to try:

@sjakobi sjakobi added performance All the slow and inefficient things caching labels May 21, 2020
@sjakobi
Copy link
Collaborator Author

sjakobi commented May 21, 2020

I've also tried looking at the core, but it's fairly impenetrable to me.

@Gabriel439 do you have any other suggestions what we could do except ask for help on the cborg issue tracker?

@Gabriella439
Copy link
Collaborator

@sjakobi: I ran into the same issue and I also don't know how to make progress

@sjakobi
Copy link
Collaborator Author

sjakobi commented May 21, 2020

I've opened well-typed/cborg#236.

@sjakobi
Copy link
Collaborator Author

sjakobi commented May 22, 2020

While looking at decodeExpressionInternal I noticed that we're frequently using Control.Monad.replicateM, which I suspected to be lazier than ideal. I tried replacing it with

replicateM :: Int -> Decoder s a -> Decoder s [a]
replicateM n d = Vector.toList <$> Vector.replicateM n d

Vector.replicateM has a Monad constraint instead of Applicative. But that didn't seem to impact performance at all.

I've also just tried this version:

replicateM :: Int -> Decoder s a -> Decoder s [a]
replicateM n d = go n
  where
    go 0 = return []
    go n = do
        !x <- d
        (x :) <$> go (n - 1)

That seems to improve speed in the range of 5-10%. PR incoming.


Given the sheer size of decodeExpressionInternal, I've also wondered whether it might be worth trying to move some things "off the hot path". I believe I've read some comments by Harendra Kumar regarding such an optimization. I don't really expect a lot of benefit from that, though.

@Gabriella439
Copy link
Collaborator

@sjakobi: Another thing that might further improve performance is eliminating the list intermediate. In many cases the list returned by replicateM is then being transformed into yet another data structure (such as nested Apps or a Dhall.Map). If we were to directly assemble the target data structure without going through the list intermediate it might be faster.

@sjakobi
Copy link
Collaborator Author

sjakobi commented May 22, 2020

@Gabriel439 Yeah. For ListLit we could try Data.Sequence.replicateM.

Regarding Map (and Set) we could look at haskell/containers#405, but I think we shouldn't rely on the input being in ascending order already.

@Gabriella439
Copy link
Collaborator

Gabriella439 commented May 22, 2020

@sjakobi: Well, if it is an expression protected by an integrity check we could rely on the input being sorted. The Serialise instance could make no such assumptions, but we could expose extra utilities outside that type-class that are faster for the case where the fields are sorted. It's probably worth checking to at least see if the sorted assumption leads to a performance improvement

@sjakobi
Copy link
Collaborator Author

sjakobi commented May 22, 2020

For ListLit we could try Data.Sequence.replicateM.

I've tried this. It seems to actually reduce performance for my cpkg benchmark, possibly due the Applicative constraint impacting strictness analysis.

@Gabriella439
Copy link
Collaborator

@sjakobi: Also, if the profile is not very informative, you can fix that by adding extra cost annotations to the code using the {-# SCC "myAnnotation" #-} pragma. That's what I do whenever I need to narrow things down. See also:

https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/profiling.html#inserting-cost-centres-by-hand

Also, I highly recommend using the profiteur tool for better understanding a profile:

https://github.com/jaspervdj/profiteur

mergify bot pushed a commit that referenced this issue May 22, 2020
@sjakobi
Copy link
Collaborator Author

sjakobi commented May 23, 2020

I've experimented a bit with the map decoding (the branch is here), but haven't seen any substantial speed-ups yet. I suspect that maybe benchmarking this stuff with dhall decode --quiet isn't such a good idea after all, since that doesn't actually force the resulting expression. ;)

I'll try some more profiling too, although I do wonder how the profiling might affect the structure of the generated code and its optimizations.

sjakobi added a commit that referenced this issue May 23, 2020
…to aid profiling.

Context: #1804.
@sjakobi
Copy link
Collaborator Author

sjakobi commented May 23, 2020

Profiling with the cost centres added in #1808 reveals what kind of expressions we spend most decoding time on:

Prelude
COST CENTRE            MODULE                     SRC                                                 %time %alloc

deserialiseIncremental Codec.CBOR.Read            src/Codec/CBOR/Read.hs:(165,1)-(167,46)              22.2   17.9
dEI_builtin            Dhall.Binary               src/Dhall/Binary.hs:(159,51)-(197,91)                16.7   12.7
dEI_Merge              Dhall.Binary               src/Dhall/Binary.hs:(343,56)-(358,124)               16.7    2.5
dEI_Var-_-UInt         Dhall.Binary               src/Dhall/Binary.hs:(129,54)-(132,53)                11.1    7.2
dEI_App                Dhall.Binary               src/Dhall/Binary.hs:(236,54)-(245,56)                11.1   14.3
evalParser             Options.Applicative.Common src/Options/Applicative/Common.hs:(237,1)-(241,56)    5.6    0.0
dEI_RecordLit          Dhall.Binary               src/Dhall/Binary.hs:(372,60)-(382,75)                 5.6    2.2
dEI_Lam                Dhall.Binary               src/Dhall/Binary.hs:(247,54)-(270,118)                5.6    7.1
dEI_BoolLit            Dhall.Binary               src/Dhall/Binary.hs:(154,49)-(157,34)                 5.6    0.4
dEI_operator           Dhall.Binary               src/Dhall/Binary.hs:(297,59)-(321,47)                 0.0    1.5
dEI_Union              Dhall.Binary               src/Dhall/Binary.hs:(420,57)-(441,71)                 0.0    3.6
dEI_TextLit            Dhall.Binary               src/Dhall/Binary.hs:(503,59)-(513,63)                 0.0    1.3
dEI_Record             Dhall.Binary               src/Dhall/Binary.hs:(360,57)-(370,72)                 0.0   18.5
dEI_ListLit            Dhall.Binary               src/Dhall/Binary.hs:(323,58)-(334,92)                 0.0    2.9
dEI_Field              Dhall.Binary               src/Dhall/Binary.hs:(384,56)-(389,50)                 0.0    2.0
dEI_BoolIf             Dhall.Binary               src/Dhall/Binary.hs:(443,58)-(450,53)                 0.0    3.2
cpkg
COST CENTRE            MODULE          SRC                                      %time %alloc

deserialiseIncremental Codec.CBOR.Read src/Codec/CBOR/Read.hs:(165,1)-(167,46)   42.5   27.1
dEI_builtin            Dhall.Binary    src/Dhall/Binary.hs:(159,51)-(197,91)     15.4    8.3
dEI_TextLit            Dhall.Binary    src/Dhall/Binary.hs:(503,59)-(513,63)      9.6   13.7
dEI_Record             Dhall.Binary    src/Dhall/Binary.hs:(360,57)-(370,72)      7.1    7.6
dEI_Union              Dhall.Binary    src/Dhall/Binary.hs:(420,57)-(441,71)      5.0   16.5
dEI_Merge              Dhall.Binary    src/Dhall/Binary.hs:(343,56)-(358,124)     4.2    4.5
dEI_Field              Dhall.Binary    src/Dhall/Binary.hs:(384,56)-(389,50)      3.3    2.5
dEI_Lam                Dhall.Binary    src/Dhall/Binary.hs:(247,54)-(270,118)     2.9    5.5
dEI_Var-_-UInt         Dhall.Binary    src/Dhall/Binary.hs:(129,54)-(132,53)      2.5    3.0
dEI_RecordLit          Dhall.Binary    src/Dhall/Binary.hs:(372,60)-(382,75)      2.5    2.7
dEI_App                Dhall.Binary    src/Dhall/Binary.hs:(236,54)-(245,56)      2.1    4.2
dEI_BoolIf             Dhall.Binary    src/Dhall/Binary.hs:(443,58)-(450,53)      0.8    1.1
dEI_BoolLit            Dhall.Binary    src/Dhall/Binary.hs:(154,49)-(157,34)      0.4    1.7

Note that for dEI_Merge, dEI_Record, etc., much of the time and allocations are inherited from decoding the subexpressions.

dEI_builtin looks promising though. While looking at the Core, I had already noticed that this section of the code is absolutely massive. I think it's mostly the string (Text) matching that is so expensive.

We could try optimizing this by

  1. Not decoding to Text, thereby avoiding the conversion to UTF-16. Instead we could use decodeUtf8ByteArray.
  2. Optimize the string matching, for example by checking the length or the first character first.

I also wonder why the builtins are encoded in this string format at all. It would be much cheaper if they were encoded with simple integer tags like the operators.

Regarding the other strings, for variables, map keys, etc. we'd probably also profit to switching to UTF8, for example by using text-short. Of course we'd have to pay for conversion to UTF16 in other places then…


For App, I'll soon submit a patch that avoids the intermediate list representation.

@Gabriella439
Copy link
Collaborator

@sjakobi: The original rationale for encoding built-ins in that way was to treat them essentially the same as free variables

I agree that not decoding built-ins to UTF8 is probably a performance win

sjakobi added a commit that referenced this issue May 24, 2020
This seems to speed up decoding of Prelude and cpkg caches
by 1-2%.

Context: #1804.
sjakobi added a commit that referenced this issue May 24, 2020
This seems to speed up decoding of Prelude and cpkg caches
by 1-2%.

Context: #1804.
sjakobi added a commit that referenced this issue May 24, 2020
* Use decodeUtf8ByteArray to avoid UTF16-encoding the scrutinee.

* Optimize the pattern matching by grouping the patterns by length.

  GHC currently doesn't produce static length information for string
  literals. Consequently the pattern matching worked somewhat like this:

      s <- decodeString

      let len_s = length s

      if len_s == length "Natural/build" && sameBytes s "Natural/build"
          then return NaturalBuild
          else if len_s == length "Natural/fold" && sameBytes s "Natural/fold"
                   ...

  Decoding `Sort`, the most extreme case, would involve a total of 32
  conditional jumps as a consequence of length comparisons alone.

  Judging by the Core, we can get that number down to 8 by grouping
  the patterns by length: One to check the length of the decoded string,
  and (unfortunately) still one each for the 7 candidate literals of
  length 4.

  The number of string content comparisons should be unchanged.

The result of these optimizations is that the time to decode the cache for cpkg
is reduced by 7-9%. Decoding time for the Prelude goes down by 13-16%.

This also changes the builtin encoding to use encodeUtf8ByteArray in order
to avoid UTF16-encoding and decoding the builtins strings. I didn't check
the performance implications though.

Context: #1804.
mergify bot pushed a commit that referenced this issue May 24, 2020
This seems to speed up decoding of Prelude and cpkg caches
by 1-2%.

Context: #1804.
mergify bot added a commit that referenced this issue May 25, 2020
* Use decodeUtf8ByteArray to avoid UTF16-encoding the scrutinee.

* Optimize the pattern matching by grouping the patterns by length.

  GHC currently doesn't produce static length information for string
  literals. Consequently the pattern matching worked somewhat like this:

      s <- decodeString

      let len_s = length s

      if len_s == length "Natural/build" && sameBytes s "Natural/build"
          then return NaturalBuild
          else if len_s == length "Natural/fold" && sameBytes s "Natural/fold"
                   ...

  Decoding `Sort`, the most extreme case, would involve a total of 32
  conditional jumps as a consequence of length comparisons alone.

  Judging by the Core, we can get that number down to 8 by grouping
  the patterns by length: One to check the length of the decoded string,
  and (unfortunately) still one each for the 7 candidate literals of
  length 4.

  The number of string content comparisons should be unchanged.

The result of these optimizations is that the time to decode the cache for cpkg
is reduced by 7-9%. Decoding time for the Prelude goes down by 13-16%.

This also changes the builtin encoding to use encodeUtf8ByteArray in order
to avoid UTF16-encoding and decoding the builtins strings. I didn't check
the performance implications though.

Context: #1804.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
sjakobi added a commit that referenced this issue May 25, 2020
…to aid profiling.

Context: #1804.
@bgamari
Copy link
Contributor

bgamari commented Jun 15, 2020

For ListLit we could try Data.Sequence.replicateM.

I've tried this. It seems to actually reduce performance for my cpkg benchmark, possibly due the Applicative constraint impacting strictness analysis.

In my test (which only involved --quiet) it was (IIRC) a very small, but still positive, effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching performance All the slow and inefficient things
Projects
None yet
Development

No branches or pull requests

3 participants