Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding allocates too many blocks in old generation #318

Open
adamgundry opened this issue May 12, 2023 · 3 comments
Open

Encoding allocates too many blocks in old generation #318

adamgundry opened this issue May 12, 2023 · 3 comments

Comments

@adamgundry
Copy link
Member

adamgundry commented May 12, 2023

I've been debugging an issue in a server application that stores a large (multi-gigabyte) state in memory and occasionally serialises it to disk using serialise. In principle this serialisation should be able to happen in constant space. However, when it does so, memory usage grows significantly and forces major GCs (which we want to avoid for performance reasons, as we're using the copying collector).

The following is an attempt to reproduce the issue (using only cborg). The script constructs some fully-evaluated data in memory, then serialises it to disk. During the serialisation stage (between the second and third vertical bars in the first graph) we see that the live bytes remains constant (i.e. unnecessary data is not being retained on the heap), but the total blocks size grows significantly. By looking at the +RTS -Dg output it appears that the growth is due to blocks going straight into generation 1, and hence not being freed until a major GC. I would expect instead that blocks allocated during serialisation rapidly become garbage, and hence stay in the nursery or generation 0 to be freed in a minor GC. Perhaps eager promotion might be kicking in somewhere?

This reproduction applies to GHC 9.6.1, cborg 0.2.8.0 and bytestring 0.11.4.0, although I've also seen it on earlier versions. It's not obvious to me whether this is an issue in one of the libraries or GHC itself, but I found a related issue in serialise (#317) so this seemed like a reasonable place to report it.

Curiously this seems to depend on the structure of the state type. The problematic behaviour occurs if the state is a pair of lists of unit, but if the state is a single list of unit (using the definitions of mkState and encode guarded by #ifdef GOOD) then we get the expected behaviour, where the blocks size remains constant, as shown in the second graph. Exactly which state types work seems to vary somewhat with GHC/library versions and other details of the surrounding code.

Thanks to @bgamari and @mpickering for helping me get this far. Any further insights would be very welcome. 😁

Unexpected behaviour

Problematic block size growth

Expected behaviour

With -DGOOD:

Expected behaviour

Reproducer

import Control.DeepSeq
import Debug.Trace
import System.Environment
import System.Mem

import qualified Codec.CBOR.Encoding as CBOR
import qualified Codec.CBOR.Write as CBOR

import qualified Data.ByteString.Builder as Builder

main :: IO ()
main = do
    [n] <- getArgs
    let st = mkState (read n)
    print (rnf st)
    traceMarkerIO "End rnf"
    performMajorGC
    traceMarkerIO "Start serialise"
    Builder.writeFile "test.cbor" $ CBOR.toBuilder $ encode st
    traceMarkerIO "End serialise"
    performMajorGC
    traceMarkerIO "Start second rnf"
    print (rnf st)


#ifdef GOOD

mkState :: Int -> [()]
mkState n = replicate (2*n) ()

encode :: [()] -> CBOR.Encoding
encode = encode'

#else

mkState :: Int -> ([()], [()])
mkState n = (replicate n (), replicate n ())

encode :: ([()], [()]) -> CBOR.Encoding
encode (xs, ys) = CBOR.encodeListLen 2 <> encode' xs <> encode' ys

#endif



encode' :: [()] -> CBOR.Encoding
encode' [] = CBOR.encodeListLen 0
encode' xs = CBOR.encodeListLenIndef <> Prelude.foldr (\_x r -> CBOR.encodeNull <> r) CBOR.encodeBreak xs
cabal-version:      2.4
name:               cbortest
version:            0.1.0.0

executable cbortest
    main-is:          CBORTest.hs

    build-depends:    base, cborg, deepseq, bytestring
    default-language: Haskell2010
    ghc-options: -debug -threaded -rtsopts -Wall

    if flag(good)
      cpp-options: -DGOOD

Flag good
  Description: Use the good version
  Default:     False
  Manual:      True
cabal run cbortest -- 3000000 +RTS -s -l -N1 && eventlog2html cbortest.eventlog
@adamgundry
Copy link
Member Author

I think I now understand what is going on here.

To recap, cborg represents the encoded data as a Tokens structure, with efficient concatenation using the difference list trick:

data Tokens = TkEnd | TkNull Tokens | ...

newtype Encoding = Encoding (Tokens -> Tokens)

Now consider this program representing a cut-down encoding (in reality we have a much bigger input type with multiple lists of data that will be encoded, but integers suffice to illustrate the problem):

encode :: Int -> CBOR.Encoding
encode n = CBOR.Encoding $ bar n . bar n

bar :: Int -> CBOR.Tokens -> CBOR.Tokens
bar 0 r = CBOR.TkNull r
bar m r = CBOR.TkNull (bar (m-1) r)

In order to convert the Encoding to actual binary data, CBOR.toBuilder will evaluate the thunk bar n t where t = bar n TkEnd is another thunk.

As this evaluates we will get lots of short-lived allocations of TkNull (bar m t) (a constructor containing a thunk storing the loop counter m). However the thunk t, representing the "continuation" of the encoding once bar has finished, will be long lived if n is large. Thus it will eventually be aged into the old generation.

When we finally reach the end of the first loop, we evaluate and immediately discard the thunk t. However, even though it is no longer retained, this thunk lives in the old generation, so the result of its evaluation cannot be collected until a major GC. Thus we start to expand the size of the old generation (via eager promotion) until a major GC is necessary.

This is very reminiscent of the "conduit problem" (https://well-typed.com/blog/2016/09/sharing-conduit/). We do not (strictly) have a space leak, because major GC will reclaim the memory. But from the perspective of a minor GC, a thunk that has been aged into the old generation is effectively retained.

Memory behaviour of reproducer

Simpler reproducer

import Debug.Trace
import System.Environment
import System.Mem

import qualified Codec.CBOR.Encoding as CBOR
import qualified Codec.CBOR.Write as CBOR

import qualified Data.ByteString.Builder as Builder
import System.IO

-- | Run the program with two independent parameters: size of the encoding and
-- size of the state, e.g.
--
-- > cabal run cbortest -- 4000000 1000000 +RTS -s -l -N1
--
main :: IO ()
main = do
    [m,n] <- getArgs
    let st = mkState (read n)
    print (last st)
    traceMarkerIO "End realising state"
    performMajorGC
    traceMarkerIO "Start serialise"
    withBinaryFile "test.cbor" WriteMode $
        flip Builder.hPutBuilder $
            CBOR.toBuilder $ encode (read m)
    traceMarkerIO "End serialise"
    performMajorGC
    traceMarkerIO "Done"
    print (head st)

-- | The only purpose of the state is to fill up the old generation with data
-- that is retained while encoding takes place.
mkState :: Int -> [()]
mkState n = replicate n ()

-- | Encode two streams of nulls of length n+1.
encode :: Int -> CBOR.Encoding
encode n = CBOR.Encoding $ bar n . bar n

bar :: Int -> CBOR.Tokens -> CBOR.Tokens
bar 0 r = CBOR.TkNull r
bar m r = CBOR.TkNull (bar (m-1) r)

adamgundry added a commit that referenced this issue May 18, 2023
This adds a new token type TkDup for use when encoding, that duplicates
the contained closure. This means that encodings can explicitly duplicate
thunks that may otherwise end up in the old generation.  It isn't obvious
to me whether this is enough or whether calling dupIO every time would be
better (and whether that might have an unreasonable performance cost).
@adamgundry
Copy link
Member Author

With #319, changing the encoder to duplicate the second bar thunk resolves the issue in my test case above:

encode :: Int -> CBOR.Encoding
encode n = CBOR.Encoding $ bar n . CBOR.TkDup . bar n

Open questions:

  • Where (if anywhere) should we insert TkDups in serialise? Or should we call dupIO for every loop iteration?
  • Can dupIO be fixed and released, and is it reasonable to depend upon it?

Nice flat blocks size

adamgundry added a commit that referenced this issue May 24, 2023
This calls dupIO on the token stream each time the buffer is full.
@adamgundry
Copy link
Member Author

A conversation with @edsko made me realise that we can potentially avoid the need for TkDup, as in #321. The main remaining question is whether it is reasonable to incur dupIO as a dependency. This is primarily blocked on well-typed/dupIO#9 although I think in the case of cborg that well-typed/dupIO#20 should be an adequate workaround, as the token stream should not be a CAF assuming we are encoding dynamic data.

adamgundry added a commit that referenced this issue Aug 18, 2023
This calls dupIO before evaluating each token in the token stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant