Audit memory usage #739

mrBliss · 2019-12-19T13:49:42Z

@CodiePP reports that the node uses 1.25 GB memory after an hour. It also has an impact on chain sync speed: it slows down from the initial ~500 blocks/s to ~250 blocks/s.

It would surprise me if it were a space leak as in "we have unforced thunks in our state", because the NoUnexpectedThunks checks pass (unless we're not checking some other forms of state). Maybe we're simply holding on to more and more memory? For example, the ledger will grow. We should investigate.

The text was updated successfully, but these errors were encountered:

mrBliss · 2019-12-27T12:02:42Z

I found and fixed one memory leak: IntersectMBO/ouroboros-network#1379. As I expected, it was indeed not an unforced thunk in our state.

I'm not closing this issue yet, as there is still a memory leak in the node, e.g., 1835 MB memory in use after syncing from epoch 0 to 79.

mrBliss · 2019-12-27T14:36:39Z

Experiment

As a reference point I performed the following experiment: add all the blocks from the first 112 epochs straight from disk to the ChainDB. No ChainSync or BlockFetch server/client involved, just streaming blocks from disk and calling ChainDB.addBlock.

This took 46 minutes on my machine (with other experiments running in the background). This is roughly 50k blocks/min. At the end, the ChainDB was using 714 MB RAM.

Looking at the heap profile above, you can see that the memory usage grows but stabilises towards the end. Note that the bands mentioning ouroboros-consensus remain stable. The growing bands correspond to the ledger as the types in question are from the cardano-ledger package or are Data.Map, which is used a lot in the ledger (to store the UTxO, for example).

The uncropped, cleaned-up legend:

ARR_WORDS
Data.Map.Internal.Bin
Cardano.Chain.UTxO.Compact.CompactTxInUtxo
Ouroboros.Storage.ImmutableDB.Impl.Index.Secondary (*)
Cardano.Chain.UTxO.Compact.CompactTxOut
Ouroboros.Storage.ImmutableDB.Types.WithBlockSize (*)
Base.Block.Base.Block
Ouroboros.Storage.ImmutableDB.Types.Block (*)
MUT_ARR_PTRS_FROZEN_CLEAN
Data.ByteString.Internal.PS
Ouroboros.Storage.VolatileDB.Types.InternalBlockInfo
Data.Sequence.Internal.Node3
GHC.Word.W64#
Cardano.Crypto.Signing.Redeem.Compact.CompactRedeemVerificationKey
Data.Set.Internal.Bin

(*): These are the cached secondary indices, see IntersectMBO/ouroboros-network#1378.

Explanation memory usage

The memory growth is due to the ledger growing. The stabilisation of the growth is due to the improved coin selection algorithm. This corresponds nicely to the graph produced by @edsko (black bar = switch to new algorithm):

My conclusion

There is no noticeable memory leak in the ChainDB. The memory usage of the ChainDB is proportional to the size of the ledger.

Next steps

While I'm pretty sure there is no memory leak in the ChainDB, it is still possible that other parts of consensus leak memory.

Mempool: only grows in size when transactions are submitted. No transactions were being submitted and I still saw the memory leak.
ChainSyncClient: we check for thunks already and I traced the candidate fragment length which hovered around k, as expected.
BlockFetchClient: ?

I'm now running more experiments, including the disabling of the logging and monitoring layer (as I did in this commit). After syncing from scratch for 1 minute:

with monitoring: 669 MB RAM used, at slot 54302
without monitoring: 194 MB RAM used, at slot 62447

I'm still running a longer experiment to confirm it, but I suspect that the remaining memory leak is in the logging & monitoring code.

Note that simply tracing the messages by printing them to stdout cannot cause a memory leak, as each message is garbage collected after printing it. So the leak must be in the processing of the data, maybe in the collection of the metrics?

mrBliss · 2020-01-06T10:50:58Z

@ArturWieczorek reported that using IntersectMBO/cardano-node@085e7b7, syncing took 3 hours 38 minutes for 163 epochs (164th epoch took additional 9 minutes). The memory usage during that time: 717 MB.

This is with monitoring enabled.

In my recent experiments (different from the comment above), the memory usage after a sync of 112 epochs (https://github.com/input-output-hk/cardano-mainnet-mirror/tree/master/epochs) with monitoring disabled is also around 700 MB.

@dcoutts @CodiePP I'm now thinking that the monitoring code is not leaking memory, but just slow to process the trace messages. So during syncing it uses more memory, as it keeps a bunch of unprocessed trace messages in memory. But at the end all messages get processed, so the memory usage reduces to normal again (?). Disclaimer: I have not verified this.

@CodiePP Maybe you can verify this? Isn't it true that there is a limit on the message queue and if there are more messages they are discarded? If so, then we could verify my theory by setting this limit very low and checking whether the memory usage is indeed lower.

CodiePP · 2020-01-06T12:30:13Z

the bounded queues will only keep a reference to a log item until the processing thread decided how to handle it. Then, the item is released from the queue. Unless we observe an overflow of these queues, it probably has no influence on the queues. I will run a test on it as you proposed.

mrBliss · 2020-01-06T15:49:14Z

QA now reported that cardano-node-1.3.0 is repeatedly running out of memory with a 3 GB heap. It seems that there is still a leak in the node, but, as I have argued above, I doubt it is in consensus.

mrBliss · 2020-01-07T11:14:11Z

Related: IntersectMBO/ouroboros-network#1386

mrBliss · 2020-01-07T12:54:22Z

To confirm that there are no leaks in consensus and the network layer (I had already checked the ChainDB was leak-free), I used cardano-node (-N2) with monitoring disabled (similar to how I did it in IntersectMBO/cardano-node@3406ffe) (*) to sync all blocks (last slot is 3609830) from mainnet using this topology.

(*) Note that I'm still tracing the same output the node is tracing, I'm just printing it to stdout instead of going through the logging & monitoring layer.

It took 5h 6m on my development machine (while actively being used) over a reasonable 4G connection (don't get me started on why my ISP needs +2 months to install cable at my house 😠). That's longer than I'd want, but it might be server and or network bound (if it's not CPU bound on the client).

I tracked the memory usage of the node using ps (I forgot about +RTS -h for a sec), and got the following graph:

The x-axis is linear, it shows the number of measurements: one every 10 seconds. Note that memory usage stabilises at 983 MB. It is likely that connecting to more nodes concurrently both as a client and server requires more memory.

The output of +RTS -s:


32,765,905,908,992 bytes allocated in the heap
1,742,564,124,968 bytes copied during GC
     375,216,064 bytes maximum residency (1332 sample(s))
       9,997,912 bytes maximum slop
             357 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19933663 colls, 19933663 par   14689.603s  7931.848s     0.0004s    0.0516s
  Gen  1      1332 colls,  1331 par   264.312s  136.733s     0.1027s    0.2148s

  Parallel GC work balance: 26.53% (serial 0%, perfect 100%)

  TASKS: 1166 (1 bound, 16 peak workers (1165 total), using -N2)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time  18356.205s  (10345.012s elapsed)
  GC      time  14953.915s  (8068.581s elapsed)
  EXIT    time    0.001s  (  0.007s elapsed)
  Total   time  33310.121s  (18413.600s elapsed)

  Alloc rate    1,785,004,383 bytes per MUT second

  Productivity  55.1% of total user, 56.2% of total elapsed

The same productivity as in my other measurements. Since we're constantly downloading, it does make sense that we need to GC a lot. But we have IntersectMBO/ouroboros-network#1392 to check whether we're not allocating too much.

My conclusion

There are no noticeable leaks in consensus and the network layer. The only remaining part of the node, which was disabled during this experiment, is the logging and monitoring layer. So if there is a memory leak in the node, it must be there.

mrBliss self-assigned this Dec 23, 2019

mrBliss mentioned this issue Jan 6, 2020

Ensure there are no thunks in traced messages IntersectMBO/ouroboros-network#1386

Closed

mrBliss removed their assignment Jun 30, 2020

CarlosLopezDeLara mentioned this issue May 13, 2022

Tracing is costly IntersectMBO/cardano-node#608

Closed

dnadales transferred this issue from IntersectMBO/ouroboros-network Dec 1, 2023

dnadales added the 🏎️ performance label Dec 1, 2023

dnadales added this to Consensus Team Backlog Dec 1, 2023

dnadales moved this to 🔖 Ready in Consensus Team Backlog Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit memory usage #739

Audit memory usage #739

mrBliss commented Dec 19, 2019

mrBliss commented Dec 27, 2019

mrBliss commented Dec 27, 2019

mrBliss commented Jan 6, 2020

CodiePP commented Jan 6, 2020

mrBliss commented Jan 6, 2020

mrBliss commented Jan 7, 2020 •

edited

Loading

mrBliss commented Jan 7, 2020

Audit memory usage #739

Audit memory usage #739

Comments

mrBliss commented Dec 19, 2019

mrBliss commented Dec 27, 2019

mrBliss commented Dec 27, 2019

Experiment

Explanation memory usage

My conclusion

Next steps

mrBliss commented Jan 6, 2020

CodiePP commented Jan 6, 2020

mrBliss commented Jan 6, 2020

mrBliss commented Jan 7, 2020 • edited Loading

mrBliss commented Jan 7, 2020

My conclusion

mrBliss commented Jan 7, 2020 •

edited

Loading