Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit memory usage #739

Open
mrBliss opened this issue Dec 19, 2019 · 7 comments
Open

Audit memory usage #739

mrBliss opened this issue Dec 19, 2019 · 7 comments

Comments

@mrBliss
Copy link
Contributor

mrBliss commented Dec 19, 2019

@CodiePP reports that the node uses 1.25 GB memory after an hour. It also has an impact on chain sync speed: it slows down from the initial ~500 blocks/s to ~250 blocks/s.

It would surprise me if it were a space leak as in "we have unforced thunks in our state", because the NoUnexpectedThunks checks pass (unless we're not checking some other forms of state). Maybe we're simply holding on to more and more memory? For example, the ledger will grow. We should investigate.

@mrBliss mrBliss self-assigned this Dec 23, 2019
@mrBliss
Copy link
Contributor Author

mrBliss commented Dec 27, 2019

I found and fixed one memory leak: IntersectMBO/ouroboros-network#1379. As I expected, it was indeed not an unforced thunk in our state.

I'm not closing this issue yet, as there is still a memory leak in the node, e.g., 1835 MB memory in use after syncing from epoch 0 to 79.

@mrBliss
Copy link
Contributor Author

mrBliss commented Dec 27, 2019

Experiment

As a reference point I performed the following experiment: add all the blocks from the first 112 epochs straight from disk to the ChainDB. No ChainSync or BlockFetch server/client involved, just streaming blocks from disk and calling ChainDB.addBlock.

This took 46 minutes on my machine (with other experiments running in the background). This is roughly 50k blocks/min. At the end, the ChainDB was using 714 MB RAM.

db-bench

Looking at the heap profile above, you can see that the memory usage grows but stabilises towards the end. Note that the bands mentioning ouroboros-consensus remain stable. The growing bands correspond to the ledger as the types in question are from the cardano-ledger package or are Data.Map, which is used a lot in the ledger (to store the UTxO, for example).

The uncropped, cleaned-up legend:

  • ARR_WORDS
  • Data.Map.Internal.Bin
  • Cardano.Chain.UTxO.Compact.CompactTxInUtxo
  • Ouroboros.Storage.ImmutableDB.Impl.Index.Secondary (*)
  • Cardano.Chain.UTxO.Compact.CompactTxOut
  • Ouroboros.Storage.ImmutableDB.Types.WithBlockSize (*)
  • Base.Block.Base.Block
  • Ouroboros.Storage.ImmutableDB.Types.Block (*)
  • MUT_ARR_PTRS_FROZEN_CLEAN
  • Data.ByteString.Internal.PS
  • Ouroboros.Storage.VolatileDB.Types.InternalBlockInfo
  • Data.Sequence.Internal.Node3
  • GHC.Word.W64#
  • Cardano.Crypto.Signing.Redeem.Compact.CompactRedeemVerificationKey
  • Data.Set.Internal.Bin

(*): These are the cached secondary indices, see IntersectMBO/ouroboros-network#1378.

Explanation memory usage

The memory growth is due to the ledger growing. The stabilisation of the growth is due to the improved coin selection algorithm. This corresponds nicely to the graph produced by @edsko (black bar = switch to new algorithm):
ledgerdb-full svg

My conclusion

There is no noticeable memory leak in the ChainDB. The memory usage of the ChainDB is proportional to the size of the ledger.

Next steps

While I'm pretty sure there is no memory leak in the ChainDB, it is still possible that other parts of consensus leak memory.

  • Mempool: only grows in size when transactions are submitted. No transactions were being submitted and I still saw the memory leak.
  • ChainSyncClient: we check for thunks already and I traced the candidate fragment length which hovered around k, as expected.
  • BlockFetchClient: ?

I'm now running more experiments, including the disabling of the logging and monitoring layer (as I did in this commit). After syncing from scratch for 1 minute:

  • with monitoring: 669 MB RAM used, at slot 54302
  • without monitoring: 194 MB RAM used, at slot 62447

I'm still running a longer experiment to confirm it, but I suspect that the remaining memory leak is in the logging & monitoring code.

Note that simply tracing the messages by printing them to stdout cannot cause a memory leak, as each message is garbage collected after printing it. So the leak must be in the processing of the data, maybe in the collection of the metrics?

@mrBliss
Copy link
Contributor Author

mrBliss commented Jan 6, 2020

@ArturWieczorek reported that using IntersectMBO/cardano-node@085e7b7, syncing took 3 hours 38 minutes for 163 epochs (164th epoch took additional 9 minutes). The memory usage during that time: 717 MB.

This is with monitoring enabled.

In my recent experiments (different from the comment above), the memory usage after a sync of 112 epochs (https://github.com/input-output-hk/cardano-mainnet-mirror/tree/master/epochs) with monitoring disabled is also around 700 MB.

@dcoutts @CodiePP I'm now thinking that the monitoring code is not leaking memory, but just slow to process the trace messages. So during syncing it uses more memory, as it keeps a bunch of unprocessed trace messages in memory. But at the end all messages get processed, so the memory usage reduces to normal again (?). Disclaimer: I have not verified this.

@CodiePP Maybe you can verify this? Isn't it true that there is a limit on the message queue and if there are more messages they are discarded? If so, then we could verify my theory by setting this limit very low and checking whether the memory usage is indeed lower.

@CodiePP
Copy link
Contributor

CodiePP commented Jan 6, 2020

the bounded queues will only keep a reference to a log item until the processing thread decided how to handle it. Then, the item is released from the queue. Unless we observe an overflow of these queues, it probably has no influence on the queues. I will run a test on it as you proposed.

@mrBliss
Copy link
Contributor Author

mrBliss commented Jan 6, 2020

QA now reported that cardano-node-1.3.0 is repeatedly running out of memory with a 3 GB heap. It seems that there is still a leak in the node, but, as I have argued above, I doubt it is in consensus.

@mrBliss
Copy link
Contributor Author

mrBliss commented Jan 7, 2020

Related: IntersectMBO/ouroboros-network#1386

@mrBliss
Copy link
Contributor Author

mrBliss commented Jan 7, 2020

To confirm that there are no leaks in consensus and the network layer (I had already checked the ChainDB was leak-free), I used cardano-node (-N2) with monitoring disabled (similar to how I did it in IntersectMBO/cardano-node@3406ffe) (*) to sync all blocks (last slot is 3609830) from mainnet using this topology.

(*) Note that I'm still tracing the same output the node is tracing, I'm just printing it to stdout instead of going through the logging & monitoring layer.

It took 5h 6m on my development machine (while actively being used) over a reasonable 4G connection (don't get me started on why my ISP needs +2 months to install cable at my house 😠). That's longer than I'd want, but it might be server and or network bound (if it's not CPU bound on the client).

I tracked the memory usage of the node using ps (I forgot about +RTS -h for a sec), and got the following graph:

memusage

The x-axis is linear, it shows the number of measurements: one every 10 seconds. Note that memory usage stabilises at 983 MB. It is likely that connecting to more nodes concurrently both as a client and server requires more memory.

The output of +RTS -s:


32,765,905,908,992 bytes allocated in the heap
1,742,564,124,968 bytes copied during GC
     375,216,064 bytes maximum residency (1332 sample(s))
       9,997,912 bytes maximum slop
             357 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19933663 colls, 19933663 par   14689.603s  7931.848s     0.0004s    0.0516s
  Gen  1      1332 colls,  1331 par   264.312s  136.733s     0.1027s    0.2148s

  Parallel GC work balance: 26.53% (serial 0%, perfect 100%)

  TASKS: 1166 (1 bound, 16 peak workers (1165 total), using -N2)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time  18356.205s  (10345.012s elapsed)
  GC      time  14953.915s  (8068.581s elapsed)
  EXIT    time    0.001s  (  0.007s elapsed)
  Total   time  33310.121s  (18413.600s elapsed)

  Alloc rate    1,785,004,383 bytes per MUT second

  Productivity  55.1% of total user, 56.2% of total elapsed

The same productivity as in my other measurements. Since we're constantly downloading, it does make sense that we need to GC a lot. But we have IntersectMBO/ouroboros-network#1392 to check whether we're not allocating too much.

My conclusion

There are no noticeable leaks in consensus and the network layer. The only remaining part of the node, which was disabled during this experiment, is the logging and monitoring layer. So if there is a memory leak in the node, it must be there.

@mrBliss mrBliss removed their assignment Jun 30, 2020
@dnadales dnadales transferred this issue from IntersectMBO/ouroboros-network Dec 1, 2023
@dnadales dnadales moved this to 🔖 Ready in Consensus Team Backlog Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants