-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit memory usage #739
Comments
I found and fixed one memory leak: IntersectMBO/ouroboros-network#1379. As I expected, it was indeed not an unforced thunk in our state. I'm not closing this issue yet, as there is still a memory leak in the node, e.g., 1835 MB memory in use after syncing from epoch 0 to 79. |
ExperimentAs a reference point I performed the following experiment: add all the blocks from the first 112 epochs straight from disk to the ChainDB. No ChainSync or BlockFetch server/client involved, just streaming blocks from disk and calling This took 46 minutes on my machine (with other experiments running in the background). This is roughly 50k blocks/min. At the end, the ChainDB was using 714 MB RAM. Looking at the heap profile above, you can see that the memory usage grows but stabilises towards the end. Note that the bands mentioning The uncropped, cleaned-up legend:
(*): These are the cached secondary indices, see IntersectMBO/ouroboros-network#1378. Explanation memory usageThe memory growth is due to the ledger growing. The stabilisation of the growth is due to the improved coin selection algorithm. This corresponds nicely to the graph produced by @edsko (black bar = switch to new algorithm): My conclusionThere is no noticeable memory leak in the ChainDB. The memory usage of the ChainDB is proportional to the size of the ledger. Next stepsWhile I'm pretty sure there is no memory leak in the ChainDB, it is still possible that other parts of consensus leak memory.
I'm now running more experiments, including the disabling of the logging and monitoring layer (as I did in this commit). After syncing from scratch for 1 minute:
I'm still running a longer experiment to confirm it, but I suspect that the remaining memory leak is in the logging & monitoring code. Note that simply tracing the messages by printing them to stdout cannot cause a memory leak, as each message is garbage collected after printing it. So the leak must be in the processing of the data, maybe in the collection of the metrics? |
@ArturWieczorek reported that using IntersectMBO/cardano-node@085e7b7, syncing took 3 hours 38 minutes for 163 epochs (164th epoch took additional 9 minutes). The memory usage during that time: 717 MB. This is with monitoring enabled. In my recent experiments (different from the comment above), the memory usage after a sync of 112 epochs (https://github.com/input-output-hk/cardano-mainnet-mirror/tree/master/epochs) with monitoring disabled is also around 700 MB. @dcoutts @CodiePP I'm now thinking that the monitoring code is not leaking memory, but just slow to process the trace messages. So during syncing it uses more memory, as it keeps a bunch of unprocessed trace messages in memory. But at the end all messages get processed, so the memory usage reduces to normal again (?). Disclaimer: I have not verified this. @CodiePP Maybe you can verify this? Isn't it true that there is a limit on the message queue and if there are more messages they are discarded? If so, then we could verify my theory by setting this limit very low and checking whether the memory usage is indeed lower. |
the bounded queues will only keep a reference to a log item until the processing thread decided how to handle it. Then, the item is released from the queue. Unless we observe an overflow of these queues, it probably has no influence on the queues. I will run a test on it as you proposed. |
QA now reported that |
Related: IntersectMBO/ouroboros-network#1386 |
To confirm that there are no leaks in consensus and the network layer (I had already checked the ChainDB was leak-free), I used (*) Note that I'm still tracing the same output the node is tracing, I'm just printing it to It took 5h 6m on my development machine (while actively being used) over a reasonable 4G connection (don't get me started on why my ISP needs +2 months to install cable at my house 😠). That's longer than I'd want, but it might be server and or network bound (if it's not CPU bound on the client). I tracked the memory usage of the node using The x-axis is linear, it shows the number of measurements: one every 10 seconds. Note that memory usage stabilises at 983 MB. It is likely that connecting to more nodes concurrently both as a client and server requires more memory. The output of
The same productivity as in my other measurements. Since we're constantly downloading, it does make sense that we need to GC a lot. But we have IntersectMBO/ouroboros-network#1392 to check whether we're not allocating too much. My conclusionThere are no noticeable leaks in consensus and the network layer. The only remaining part of the node, which was disabled during this experiment, is the logging and monitoring layer. So if there is a memory leak in the node, it must be there. |
@CodiePP reports that the node uses 1.25 GB memory after an hour. It also has an impact on chain sync speed: it slows down from the initial ~500 blocks/s to ~250 blocks/s.
It would surprise me if it were a space leak as in "we have unforced thunks in our state", because the
NoUnexpectedThunks
checks pass (unless we're not checking some other forms of state). Maybe we're simply holding on to more and more memory? For example, the ledger will grow. We should investigate.The text was updated successfully, but these errors were encountered: