-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validator runs OOM if it cannot make roots #23061
Comments
One file per
In-memory representation seems fine here to me actually. Quick access, should the cluster decide to build off a disk-based bank, seems more important |
Thanks @mvines, I agree to both. |
I would lean towards avoiding rocks in favor of one-bank-per-file for several reasons:
|
There's a lot of talk about banks, but what is the percentage of usage for banks vs. everything else? And is there one or a few things inside the bank that is expensive compared to everything else? |
I have been trying to measure the size of the bank, and number of banks it created before running OOM. Something is not lining up. Number of banks: 700 banks There are a few |
I haven't had the chance to dig into this, but I have observed that nodes that are unable to catchup to the cluster balloon in memory usage. Somewhat recently, I saw a GCE node (24 vCPU (N2) / 128 GB RAM / separate SSD's for ledger / accounts / os+snapshot) that was losing ground against mainnet exhibit used memory growth at 20-30 GB per day. The baseline memory usage was about ~60 GB once it was ready to start replaying slots, so it ate through the other ~68 GB and OOM'd in several days. This node was still making roots / freezing points, so to Stephen's point, there might be some other, larger offender(s) that are hogging memory |
I did some further calculations. To account for all the storage needed for bank's sub-structs and hashmap/vectors, I used It still might be worth it to optimize |
I collected some heaptrack traces. Some memory leaks are being reported in RocksDB. The complete leak trace is at https://gist.github.com/pgarg66/f1b7379579bb3fa415515c47f83254b6#file-gistfile1-txt-L3330 The following seems something we should look at
A quick read of the RocksDB code seems to point at https://github.com/facebook/rocksdb/blob/c42d0cf862adeeeb22b311ef5a9851c2b06b711b/util/aligned_buffer.h#L160
This gets called via |
I am running with |
Looks like the memory leak traces reported above might be from panic stacktraces. I think it's normal to see memory leak when a program goes panic since it causes the program to stop unexpectedly and left allocated-memory unreleased.
And the majority of the leak traces are pointed to BlockBasedTable::Open(), which makes sense when the program goes into panic as those memory allocated for SST files will leave unreleased. This open call is one of the very first calls to RocksDB. It's a little bit hard to imagine no one reports a memory leak here, although there're still some possibilities. Maybe try to see why the validator went panic? Or if we want to double-check whether any open calls from RocksDB cause the memory leak, maybe the ledger tool is another quick option as it also tries to open RocksDB? |
As an experiment, I tried commenting out the creation of new banks in replay stage (in The following capture has two data points. The first section in the graphs is when the banks were being created and added to bank_forks. The second is with the above change. So, looks like offloading frozen banks to storage should help with this issue. |
Please note that there are several places that are using copy-on-write semantics to save on runtime memory. In particular, If we hit some degenerate scenario where banks are churned on and off the disk, then each bank will have their own copies and that might exacerbate memory use. |
I am still experimenting with bank to see if it is really causing the excessive RAM usage. The serialized bank (created using Another thought was that accounts_db might be holding on to RAM. To rule that out, I hacked the code to not record the transactions. The system still ran OOM. So at the moment, it's not clear what's the actual cause. I have just ruled out a few parts of the code. |
I collected another heaptrack log (https://gist.githubusercontent.com/pgarg66/f0e179ffa22b37922f69f78b8cd2c681/raw/89d68dd2ece3810ea1a0bd20170ee7a151aeefd0/heaptrack_print_memory_leaks) Leaks are reported at the following places. These may not be the actual leaks, but the memory that was still allocated when the validator exited. But, that's what we are trying to analyze and potentially optimize.
|
Those last three seem very likely to be legit memory usage that we just don't clean up on exit. First one is quite mysterious though |
Yes, totally agree about the last 3. Maybe something can be done to optimize the usage, or flush them to disk in certain scenarios (e.g. when the node is stuck in a partition and cannot make progress). |
I have done some experiments with what seems to be false leak reporting of items related to the in-mem accounts index. In one, I padded how much we allocate for each of these entries and ran a mnb validator to see if memory usage leaked out of control faster than a normal validator. The results indicated we aren't really leaking anything. It is true that we aren't cleaning up the in-mem accounts index on exit. Using the disk backed accounts index is a solution to this. We would then greatly reduce how much is being held in memory. |
Thanks @jeffwashington, I'll rerun the test with the above configuration. |
I collected a profiling trace using Also, attaching the full PDF generated by the profiler. |
StakeDelegations is using Arc to implement copy-on-write semantics: https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16 However a single delegation change will still clone the entire hash-map, resulting in excessive memory use as observed in: solana-labs#23061 (comment) This commit instead uses immutable hash-map implementing structural sharing: > which means that if two data structures are mostly copies of each > other, most of the memory they take up will be shared between them. https://docs.rs/im/latest/im/
StakeDelegations is using Arc to implement copy-on-write semantics: https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16 However a single delegation change will still clone the entire hash-map, resulting in excessive memory use as observed in: solana-labs#23061 (comment) This commit instead uses immutable hash-map implementing structural sharing: > which means that if two data structures are mostly copies of each > other, most of the memory they take up will be shared between them. https://docs.rs/im/latest/im/
StakeDelegations is using Arc to implement copy-on-write semantics: https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16 However a single delegation change will still clone the entire hash-map, resulting in excessive memory use as observed in: solana-labs#23061 (comment) This commit instead uses immutable hash-map implementing structural sharing: > which means that if two data structures are mostly copies of each > other, most of the memory they take up will be shared between them. https://docs.rs/im/latest/im/
With respect to However a single delegation change will still clone the entire hash-map, even though the hashmap will not change much. An alternative is to use immutable data structures, which implement structural sharing:
#23585 is testing such change. The immutable hashmap is mostly a drop-in replacement except that it does not implement rayon parallel iterators. So we need to copy into a vector here: Besides that, with respect to performance, the documentation claims that:
but that also to be tested. |
@behzadnouri thanks for the PR (#23585). I can cherry pick it today in my workspace to see how it helps with the OOM condition. |
@pgarg66 do you think adding some heap estimates for the stakes cache would be useful for memory profiling? I added some here: https://github.com/jstarry/solana/tree/trace/memory-usage |
Agreed, but @pgarg66's trace looks orders of magnitudes off from what I expect.. unless I'm reading it wrong. Is the heap profile saying there is still many gigs of memory allocated for stake delegations? |
StakeDelegations is using Arc to implement copy-on-write semantics: https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16 However a single delegation change will still clone the entire hash-map, resulting in excessive memory use as observed in: solana-labs#23061 (comment) This commit instead uses immutable hash-map implementing structural sharing: > which means that if two data structures are mostly copies of each > other, most of the memory they take up will be shared between them. https://docs.rs/im/latest/im/
hmmm, that is very confusing. I don't see how theoretically that can happen. Worst case should be the same. The previous call graph from jemalloc profile shows |
I'll let it run some more and then collect the trace. |
@behzadnouri here is the PDF with the profiling trace. |
Fwiw, all my changes related to this are making their way into master. |
@behzadnouri I think something has regressed on the master branch compared to your PR branch (#23585). There are a bunch of commits since that PR. I'll try to re-run your PR branch tomorrow and see if it still has the expected performance. If it does, we can try to narrow down the commit that regressed it. |
@behzadnouri I did a test run today with master branch and #23692 (after rebasing on the same commit as master). Both are yielding same performance. About 8700 banks were created before the node ran out of RAM. Master branch was on a1a29b0 commit hash. On that note, there has been some regression in past week or so. Earlier we were able to create about 10K banks (with #23585). These runs were without profiling changes, just to make sure there's no impact due to profiling. I'll rerun master tomorrow after enabling profiling. That should give some information on what's the next biggest consumer of RAM and if anything can be optimized. |
This is from the latest profiling trace I have collected (based on commit a1a29b0) |
Looks like the following code is causing cloning solana/runtime/src/vote_account.rs Line 110 in 0c0db93
It correlates with this part of the heap trace Reading
Likely that's what is causing the cloning here. As an experiment, I commented out call to A few questions:
|
@behzadnouri any ideas on #23061 (comment)? |
I think this is what #23692 was attempting to address? |
Yes, I think so. Unfortunately, there was no change in RAM usage with that PR. |
See this comment for reference: #23061 (comment) |
Let me do some testing and I will get back to you. Meanwhile, it will be great if you can also test #23687 in your setup and compare with what I got in my test runs: #23687 (comment). Thank you, |
Here's the heap profiling trace from the experiment (comment out The heap was growing at a very low rate (about 1.5% per hour). And, the node was still up after 12 hours. The usage was about 60%. I have started a run with #23687. I have also cherry-picked the changes from #23585, to remove the cases that are already addressed on master branch. Unfortunately, I could not do a clean rebase of the branch with master. |
So far, I am liking #23687. Here's the RAM usage graph with it I'll restart the node after rebasing this on top of master. The current run is on an old baseline. So not ideal for comparison. |
After picking the changes to master, #23687 is not helping. Not sure if the previous run (with an older baseline) was a one off, or something has changed in |
#23687 is working well on the PR branch (based on commit 9b46f9b). I retested it last night. The RAM usage is same as this #23061 (comment). I tried to analyze the difference between master (commit a1a29b0) and the PR, and f54e746 was one of the change in the related area. Reverting it didn't help. And, the node ran out of RAM similar to #23061 (comment) I'll start bisecting the differences between 9b46f9b and a1a29b0. @jeffwashington, @behzadnouri, I am wondering if you are aware of any recent change that could be causing this? |
I believe 9b46f9b included a number of commits which had consensus issues and would fork off the validator: #23672 (comment)
I am also wondering when do you stop the validator from making roots? Is it from the start or after it has caught up with the cluster? |
The test validator never makes root in this case. |
I don't know if that makes any difference, but what I have been doing in testing is to leave the validator until it is in sync and has caught up with the cluster, then I stop it from making roots to see how it changes. |
I tried bisecting the difference between #23061 (comment) and #23061 (comment) over the weekend. I ended up rolling all the way back to the branch that #23687 is based on. The results are not conclusive. Just now, for the same commit in two different runs, it ran out of memory once and followed the graph of #23061 (comment) in the other run. At this point, I am not convinced that #23687 is actually solving the problem. The difference could be based on the network load and other conditions. Need more thoughts on a potential solution to vote accounts (reference: #23061 (comment)) |
Problem
If a validator is unable to make new roots, it's running out of memory. The validator connected to mainnet-beta ran out of memory in around 12 minutes.
Proposed Solution
The bank_forks table continues to grow as more banks are created, but no newer roots are generated. The proposed solution is to move the bank to disk.
The following design should be able to help.
Few considerations:
The text was updated successfully, but these errors were encountered: