Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator runs OOM if it cannot make roots #23061

Closed
pgarg66 opened this issue Feb 10, 2022 · 59 comments
Closed

Validator runs OOM if it cannot make roots #23061

pgarg66 opened this issue Feb 10, 2022 · 59 comments
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@pgarg66
Copy link
Contributor

pgarg66 commented Feb 10, 2022

Problem

If a validator is unable to make new roots, it's running out of memory. The validator connected to mainnet-beta ran out of memory in around 12 minutes.

Proposed Solution

The bank_forks table continues to grow as more banks are created, but no newer roots are generated. The proposed solution is to move the bank to disk.

The following design should be able to help.

  1. Calculate how many banks can reasonably stay in RAM for a given node (based on system memory size, and memory needed per bank object).
  2. When a new bank is inserted in bank_forks, check if the size of table would cross the limit (from step 1)
  3. If the limit would cross, move N latest frozen banks to disk. (modify bank_forks data structure to identify if the bank is available in RAM or disk).
  4. Update method to retrieve bank to return it from RAM or read it into RAM from the disk (if it was moved to the disk).

Few considerations:

  1. Should we use rocksdb to store and load bank. Alternatively, we can use one file per bank with slot number embedded in the file name.
  2. Custom serialize/deserialize for Bank to optimize storage
  3. Move banks to disk on a standalone thread to reduce the impact to existing usecases.
@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 10, 2022

CC: @mvines, @sakridge

@mvines
Copy link
Member

mvines commented Feb 10, 2022

Should we use rocksdb to store and load bank. Alternatively, we can use one file per bank with slot number embedded in the file name.

One file per slot-bankhash seems nice. Maybe even a mmaped file that's been deleted on disk, so only the open fd is keeping it around? Open fd is then closed once the BankFork's finalized slot moves past

Custom serialize/deserialize for Bank to optimize storage

In-memory representation seems fine here to me actually. Quick access, should the cluster decide to build off a disk-based bank, seems more important

@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 10, 2022

Thanks @mvines, I agree to both.

@steviez
Copy link
Contributor

steviez commented Feb 10, 2022

  1. Should we use rocksdb to store and load bank. Alternatively, we can use one file per bank with slot number embedded in the file name.

I would lean towards avoiding rocks in favor of one-bank-per-file for several reasons:

  • The Rocks flow is to write new data to memory (memtable's) and then persist to disk (SST's) in the background at configurable thresholds (as memtables fill up / we hit a # of memtables); using rocks would create an extra copy in memory and move the time at which memory is freed out (seems like the goal of this is to drop memory usage immediately)
  • If the node recovers and starts making roots again, I'm guessing we could wipe all banks on disk older than newest root. Reclaiming the disk space immediately is easier outside of Rocks
  • The rocks features that are nice for a general use-case would add extra overhead for this somewhat simple storage case

@sakridge
Copy link
Member

There's a lot of talk about banks, but what is the percentage of usage for banks vs. everything else? And is there one or a few things inside the bank that is expensive compared to everything else?

@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 11, 2022

I have been trying to measure the size of the bank, and number of banks it created before running OOM. Something is not lining up.

Number of banks: 700 banks
Size of bank: 1432 bytes

There are a few Arc in Bank struct. The above number only includes the size of Arc and not the actual object. Wondering if all such objects should be included in bank size (could these be shared across banks)?

@steviez
Copy link
Contributor

steviez commented Feb 12, 2022

There's a lot of talk about banks, but what is the percentage of usage for banks vs. everything else?

I haven't had the chance to dig into this, but I have observed that nodes that are unable to catchup to the cluster balloon in memory usage.

Somewhat recently, I saw a GCE node (24 vCPU (N2) / 128 GB RAM / separate SSD's for ledger / accounts / os+snapshot) that was losing ground against mainnet exhibit used memory growth at 20-30 GB per day. The baseline memory usage was about ~60 GB once it was ready to start replaying slots, so it ate through the other ~68 GB and OOM'd in several days.

This node was still making roots / freezing points, so to Stephen's point, there might be some other, larger offender(s) that are hogging memory

@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 12, 2022

I did some further calculations. To account for all the storage needed for bank's sub-structs and hashmap/vectors, I used bank_to_stream() to check how much total space could be consumed by the bank. It's 7129 bytes per bank. And, it runs OOM after about 700 banks. So that accounts for about 5GB of memory on a 128GB machine. Removing the baseline 60GB usage, 5GB does not come close to the remaining memory on the system. So, there might be some other offender that's consuming memory.

It still might be worth it to optimize bank_fork memory consumption.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 16, 2022

I collected some heaptrack traces. Some memory leaks are being reported in RocksDB. The complete leak trace is at https://gist.github.com/pgarg66/f1b7379579bb3fa415515c47f83254b6#file-gistfile1-txt-L3330

The following seems something we should look at

42.45MB leaked over 195 calls from
rocksdb::BlockFetcher::ReadBlockContents()
38.62MB leaked over 54 calls from:
    rocksdb::Status rocksdb::(anonymous namespace)::ReadBlockFromFile<>(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<>*, rocksdb::ImmutableCFOptions const&, bool, bool, rocksdb::BlockType, rocksdb::UncompressionDict const&, rocksdb::PersistentCacheOptions const&, unsigned long, rocksdb::MemoryAllocator*, bool, bool, rocksdb::FilterPolicy const*)
    rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const
    rocksdb::BlockBasedTable::IndexReaderCommon::ReadIndexBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<>*)
    rocksdb::BinarySearchIndexReader::Create(rocksdb::BlockBasedTable const*, rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr<>*)
    rocksdb::BlockBasedTable::CreateIndexReader(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase<>*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr<>*)
    rocksdb::BlockBasedTable::PrefetchIndexAndFilterBlocks(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase<>*, rocksdb::BlockBasedTable*, bool, rocksdb::BlockBasedTableOptions const&, int, unsigned long, unsigned long, rocksdb::BlockCacheLookupContext*)
    rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableCFOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::unique_ptr<>&&, unsigned long, std::unique_ptr<>*, rocksdb::SliceTransform const*, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long)
    rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::unique_ptr<>&&, unsigned long, std::unique_ptr<>*, bool) const [clone .localalias]
    rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::unique_ptr<>*, rocksdb::SliceTransform const*, bool, int, bool, unsigned long)
    rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, rocksdb::SliceTransform const*, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long)
    std::_Function_handler<>::_M_invoke(std::_Any_data const&)
    0x7fc3ec15ede3
      in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    start_thread
      in /lib/x86_64-linux-gnu/libpthread.so.0
    __clone
      in /lib/x86_64-linux-gnu/libc.so.6

A quick read of the RocksDB code seems to point at https://github.com/facebook/rocksdb/blob/c42d0cf862adeeeb22b311ef5a9851c2b06b711b/util/aligned_buffer.h#L160

    size_t new_capacity = Roundup(requested_capacity, alignment_);
    char* new_buf = new char[new_capacity + alignment_];

This gets called via ReadBlockContents function in the callstack above. I tried looking when it gets deleted and couldn't find it the aligned_buffer class. Need further analysis if this is really a leak.

@steviez
Copy link
Contributor

steviez commented Feb 16, 2022

@pgarg66 - Which commit of our stack were you running against? We recently bumped our rocksdb dependency version in #23031; probably wishful thinking on my end to think an upgrade would resolve these.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 16, 2022

I am running with 86d465c53137978b464e01c1e1155b9f58b166cb. I don't think it included the rocksdb update. I'll rerun this test on tip of master.

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Feb 16, 2022

https://gist.github.com/pgarg66/f1b7379579bb3fa415515c47f83254b6#file-gistfile1-txt-L3330

Looks like the memory leak traces reported above might be from panic stacktraces. I think it's normal to see memory leak when a program goes panic since it causes the program to stop unexpectedly and left allocated-memory unreleased.

      at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/ops/function.rs:259
    std::panicking::try::do_call::hcba55cf6d5b5533e
      at library/std/src/panicking.rs:406
    std::panicking::try::h0b2a05128a4ee609
      at library/std/src/panicking.rs:370
    std::panic::catch_unwind::he1deef49e02fb06c
      at library/std/src/panic.rs:133
    std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::hf44e73ef18e45ffd
      at library/std/src/rt.rs:128
    std::panicking::try::do_call::h894daf8a782b48f4
      at library/std/src/panicking.rs:406
    std::panicking::try::hd3e4f8d86f3a7fb5
      at library/std/src/panicking.rs:370
    std::panic::catch_unwind::h2e69404746fb3d50
      at library/std/src/panic.rs:133
    std::rt::lang_start_internal::hec7f1b06f38d8409
      at library/std/src/rt.rs:128

And the majority of the leak traces are pointed to BlockBasedTable::Open(), which makes sense when the program goes into panic as those memory allocated for SST files will leave unreleased. This open call is one of the very first calls to RocksDB. It's a little bit hard to imagine no one reports a memory leak here, although there're still some possibilities.

Maybe try to see why the validator went panic? Or if we want to double-check whether any open calls from RocksDB cause the memory leak, maybe the ledger tool is another quick option as it also tries to open RocksDB?

@pgarg66
Copy link
Contributor Author

pgarg66 commented Feb 23, 2022

As an experiment, I tried commenting out the creation of new banks in replay stage (in generate_new_bank_forks function). That has a huge impact on memory consumption (in scenario where no roots are being made).

The following capture has two data points. The first section in the graphs is when the banks were being created and added to bank_forks. The second is with the above change.

image

So, looks like offloading frozen banks to storage should help with this issue.

@behzadnouri
Copy link
Contributor

Please note that there are several places that are using copy-on-write semantics to save on runtime memory. In particular,
vote-accounts: #19362
staked-nodes: #19090
cached-executors: #22343
and I guess almost any other place which is using Arc::make_mut:
https://github.com/solana-labs/solana/search?q=make_mut

If we hit some degenerate scenario where banks are churned on and off the disk, then each bank will have their own copies and that might exacerbate memory use.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 1, 2022

I am still experimenting with bank to see if it is really causing the excessive RAM usage. The serialized bank (created using bank_to_stream function) only accounts for 7K bytes of RAM. There are a few fields that do not get serialized. In last few days, I tried setting those fields to default for frozen banks (when new banks are added to bank forks, and no roots are being made). The node is still running OOM at the same pace. So it's not conclusive if bank is really causing the problem.

Another thought was that accounts_db might be holding on to RAM. To rule that out, I hacked the code to not record the transactions. The system still ran OOM.

So at the moment, it's not clear what's the actual cause. I have just ruled out a few parts of the code.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 4, 2022

I collected another heaptrack log (https://gist.githubusercontent.com/pgarg66/f0e179ffa22b37922f69f78b8cd2c681/raw/89d68dd2ece3810ea1a0bd20170ee7a151aeefd0/heaptrack_print_memory_leaks)

Leaks are reported at the following places. These may not be the actual leaks, but the memory that was still allocated when the validator exited. But, that's what we are trying to analyze and potentially optimize.

MEMORY LEAKS
42.33GB leaked over 1335174372 calls from
alloc::alloc::alloc::hcf45a56d3ada97da
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/alloc.rs:87
  in /home/pankaj_solana_com/solana/target/release/solana-validator
alloc::alloc::Global::alloc_impl::hf5f1c3e83349c3a6
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/alloc.rs:169
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h8b053bc7a921a375
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/alloc.rs:229
alloc::raw_vec::RawVec$LT$T$C$A$GT$::allocate_in::hcce5e3994213fbec
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/raw_vec.rs:185
alloc::raw_vec::RawVec$LT$T$C$A$GT$::with_capacity_in::h62ed5a3e428492c3
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/raw_vec.rs:132
alloc::vec::Vec$LT$T$C$A$GT$::with_capacity_in::hb2f6ef43d807d6da
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/mod.rs:609
alloc::vec::Vec$LT$T$GT$::with_capacity::hb897f6019952c21a
  at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/mod.rs:470
_$LT$$RF$str$u20$as$u20$std..ffi..c_str..CString..new..SpecIntoVec$GT$::into_vec::h369618018255151e
  at library/std/src/ffi/c_str.rs:394
std::ffi::c_str::CString::new::h1aa4366098673971
  at library/std/src/ffi/c_str.rs:400
std::rt::init::h2b5a1add53eba618
  at library/std/src/rt.rs:85
std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::h29d3603682064ee5
  at library/std/src/rt.rs:127
std::panicking::try::do_call::haf7681c0202b69fa
  at library/std/src/panicking.rs:406
std::panicking::try::hee0a51bdf00e1f7e
  at library/std/src/panicking.rs:370
std::panic::catch_unwind::hca5c20c52c17cf0f
  at library/std/src/panic.rs:133
std::rt::lang_start_internal::h52e73755f77c7dd9
  at library/std/src/rt.rs:127
4.72GB leaked over 49640 calls from:
    hashbrown::raw::RawTable$LT$T$C$A$GT$::reserve::h850e451376c2d7f1
      at /cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.11.0/src/raw/mod.rs:646
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    hashbrown::map::HashMap$LT$K$C$V$C$S$C$A$GT$::reserve::h521fb7214ffd2faa
      at /cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.11.0/src/map.rs:803
    hashbrown::rustc_entry::_$LT$impl$u20$hashbrown..map..HashMap$LT$K$C$V$C$S$C$A$GT$$GT$::rustc_entry::h78a218b8c682b417
      at /cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.11.0/src/rustc_entry.rs:46
    std::collections::hash::map::HashMap$LT$K$C$V$C$S$GT$::entry::ha55106af5e566df0
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/collections/hash/map.rs:792
    solana_runtime::in_mem_accounts_index::InMemAccountsIndex$LT$T$GT$::insert_new_entry_if_missing_with_lock::h058a6310a046907b
      at runtime/src/in_mem_accounts_index.rs:574
    solana_runtime::accounts_index::AccountsIndex$LT$T$GT$::insert_new_if_missing_into_primary_index::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hc2483671634ecd0f
      at runtime/src/accounts_index.rs:1745
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    core::iter::traits::iterator::Iterator::for_each::call::_$u7b$$u7b$closure$u7d$$u7d$::h34233930f0b1310e
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:733
    core::iter::traits::iterator::Iterator::fold::ha0ca8f495cfc5151
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:2171
    core::iter::traits::iterator::Iterator::for_each::h579e9369bb594103
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:736
    solana_runtime::accounts_index::AccountsIndex$LT$T$GT$::insert_new_if_missing_into_primary_index::_$u7b$$u7b$closure$u7d$$u7d$::h1dc18882f62e79cf
      at runtime/src/accounts_index.rs:1743
    core::iter::traits::iterator::Iterator::for_each::call::_$u7b$$u7b$closure$u7d$$u7d$::hca945a254bc2ffb5
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:733
    core::iter::traits::iterator::Iterator::fold::hcfd0c3cc653e7548
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:2171
    core::iter::traits::iterator::Iterator::for_each::h2f444638979e0037
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:736
    solana_runtime::accounts_index::AccountsIndex$LT$T$GT$::insert_new_if_missing_into_primary_index::h8aec02bae7b739e1
      at runtime/src/accounts_index.rs:1740
    solana_runtime::accounts_db::AccountsDb::generate_index_for_slot::ha7ad0fe4098487f9
      at runtime/src/accounts_db.rs:6772
3.40GB leaked over 51141634 calls from:
    solana_runtime::accounts_index::PreAllocatedAccountMapEntry$LT$T$GT$::new::h9948c800ee753f6e
      at runtime/src/accounts_index.rs:365
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    solana_runtime::accounts_index::AccountsIndex$LT$T$GT$::insert_new_if_missing_into_primary_index::_$u7b$$u7b$closure$u7d$$u7d$::h0171afeb7d8e540b
      at runtime/src/accounts_index.rs:1726
    core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnMut$LT$A$GT$$u20$for$u20$$RF$mut$u20$F$GT$::call_mut::hd2d3f79149d96064
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:269
    core::iter::traits::iterator::Iterator::find_map::check::_$u7b$$u7b$closure$u7d$$u7d$::h2b58b93a04d1c07c
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:2489
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    core::iter::adapters::map::map_try_fold::_$u7b$$u7b$closure$u7d$$u7d$::h5e27ba51b5f160d8
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/adapters/map.rs:91
    core::iter::traits::iterator::Iterator::try_fold::h574891bf2e86486e
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:1995
    _$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::try_fold::h17764f8c9376b2dd
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/adapters/map.rs:117
    core::iter::traits::iterator::Iterator::find_map::h88a5118b8bc4c0f0
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:2495
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    _$LT$core..iter..adapters..filter_map..FilterMap$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::next::hefd28456dbc56907
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/adapters/filter_map.rs:61
    alloc::vec::Vec$LT$T$C$A$GT$::extend_desugared::h1dc32bee41361748
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/mod.rs:2638
    _$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$alloc..vec..spec_extend..SpecExtend$LT$T$C$I$GT$$GT$::spec_extend::hbf27a83f65df506f
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/spec_extend.rs:18
    _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter_nested..SpecFromIterNested$LT$T$C$I$GT$$GT$::from_iter::h942167ce439a556d
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/spec_from_iter_nested.rs:37
    _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$GT$::from_iter::h6c536d63537dd46d
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/spec_from_iter.rs:33
    _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$core..iter..traits..collect..FromIterator$LT$T$GT$$GT$::from_iter::hfbc55c9f2be9240c
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/vec/mod.rs:2541
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    core::iter::traits::iterator::Iterator::collect::h316a1bf36ac383e1
3.40GB leaked over 51141634 calls from:
    std::sync::rwlock::RwLock$LT$T$GT$::new::hfe9085455b72f347
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sync/rwlock.rs:155
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    solana_runtime::accounts_index::AccountMapEntryInner$LT$T$GT$::new::h2ec659757c1abe67
      at runtime/src/accounts_index.rs:236
    solana_runtime::accounts_index::PreAllocatedAccountMapEntry$LT$T$GT$::allocate::hf6066a9973f0de7d
      at runtime/src/accounts_index.rs:376
    solana_runtime::accounts_index::PreAllocatedAccountMapEntry$LT$T$GT$::new::h9948c800ee753f6e
      at runtime/src/accounts_index.rs:365
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    solana_runtime::accounts_index::AccountsIndex$LT$T$GT$::insert_new_if_missing_into_primary_index::_$u7b$$u7b$closure$u7d$$u7d$::h0171afeb7d8e540b
      at runtime/src/accounts_index.rs:1726
    core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnMut$LT$A$GT$$u20$for$u20$$RF$mut$u20$F$GT$::call_mut::hd2d3f79149d96064
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:269
    core::iter::traits::iterator::Iterator::find_map::check::_$u7b$$u7b$closure$u7d$$u7d$::h2b58b93a04d1c07c
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:2489
      in /home/pankaj_solana_com/solana/target/release/solana-validator
    core::iter::adapters::map::map_try_fold::_$u7b$$u7b$closure$u7d$$u7d$::h5e27ba51b5f160d8
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/adapters/map.rs:91
    core::iter::traits::iterator::Iterator::try_fold::h574891bf2e86486e
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:1995
    _$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::try_fold::h17764f8c9376b2dd
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/adapters/map.rs:117
    core::iter::traits::iterator::Iterator::find_map::h88a5118b8bc4c0f0
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/iter/traits/iterator.rs:2495
      in /home/pankaj_solana_com/solana/target/release/solana-validator

@t-nelson
Copy link
Contributor

t-nelson commented Mar 4, 2022

Those last three seem very likely to be legit memory usage that we just don't clean up on exit. First one is quite mysterious though

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 4, 2022

Yes, totally agree about the last 3. Maybe something can be done to optimize the usage, or flush them to disk in certain scenarios (e.g. when the node is stuck in a partition and cannot make progress).

@jeffwashington
Copy link
Contributor

I have done some experiments with what seems to be false leak reporting of items related to the in-mem accounts index. In one, I padded how much we allocate for each of these entries and ran a mnb validator to see if memory usage leaked out of control faster than a normal validator. The results indicated we aren't really leaking anything.

It is true that we aren't cleaning up the in-mem accounts index on exit. Using the disk backed accounts index is a solution to this. We would then greatly reduce how much is being held in memory. --accounts-index-memory-limit-mb 10000

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 8, 2022

Thanks @jeffwashington, I'll rerun the test with the above configuration.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 9, 2022

I collected a profiling trace using jemalloc profiler. It's pointing at Stakes::update_stake_delegation().

image

Also, attaching the full PDF generated by the profiler.
heap.pdf

behzadnouri added a commit to behzadnouri/solana that referenced this issue Mar 10, 2022
StakeDelegations is using Arc to implement copy-on-write semantics:
https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16

However a single delegation change will still clone the entire hash-map,
resulting in excessive memory use as observed in:
solana-labs#23061 (comment)

This commit instead uses immutable hash-map implementing structural
sharing:
> which means that if two data structures are mostly copies of each
> other, most of the memory they take up will be shared between them.
https://docs.rs/im/latest/im/
behzadnouri added a commit to behzadnouri/solana that referenced this issue Mar 10, 2022
StakeDelegations is using Arc to implement copy-on-write semantics:
https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16

However a single delegation change will still clone the entire hash-map,
resulting in excessive memory use as observed in:
solana-labs#23061 (comment)

This commit instead uses immutable hash-map implementing structural
sharing:
> which means that if two data structures are mostly copies of each
> other, most of the memory they take up will be shared between them.
https://docs.rs/im/latest/im/
behzadnouri added a commit to behzadnouri/solana that referenced this issue Mar 10, 2022
StakeDelegations is using Arc to implement copy-on-write semantics:
https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16

However a single delegation change will still clone the entire hash-map,
resulting in excessive memory use as observed in:
solana-labs#23061 (comment)

This commit instead uses immutable hash-map implementing structural
sharing:
> which means that if two data structures are mostly copies of each
> other, most of the memory they take up will be shared between them.
https://docs.rs/im/latest/im/
@behzadnouri
Copy link
Contributor

behzadnouri commented Mar 10, 2022

I collected a profiling trace using jemalloc profiler. It's pointing at Stakes::update_stake_delegation().

With respect to update_stake_delegation I think the problem is that StakeDelegations is using Arc to implement copy-on-write semantics: #21542

However a single delegation change will still clone the entire hash-map, even though the hashmap will not change much.

An alternative is to use immutable data structures, which implement structural sharing:

which means that if two data structures are mostly copies of each other, most of the memory they take up will be shared between them.
https://docs.rs/im/latest/im/

#23585 is testing such change. The immutable hashmap is mostly a drop-in replacement except that it does not implement rayon parallel iterators. So we need to copy into a vector here:
https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/bank.rs#L2529-L2530
which is a bad spot given that we already have latency issues there; but the added computation cost might be insignificant.

Besides that, with respect to performance, the documentation claims that:

HashMap is almost neck and neck with its counterpart,

but that also to be tested.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 10, 2022

@behzadnouri thanks for the PR (#23585). I can cherry pick it today in my workspace to see how it helps with the OOM condition.

@jstarry
Copy link
Member

jstarry commented Mar 10, 2022

@pgarg66 do you think adding some heap estimates for the stakes cache would be useful for memory profiling? I added some here: https://github.com/jstarry/solana/tree/trace/memory-usage

@jstarry
Copy link
Member

jstarry commented Mar 10, 2022

With respect to update_stake_delegation I think the problem is that StakeDelegations is using Arc to implement copy-on-write semantics: #21542

Agreed, but @pgarg66's trace looks orders of magnitudes off from what I expect.. unless I'm reading it wrong. Is the heap profile saying there is still many gigs of memory allocated for stake delegations?

behzadnouri added a commit to behzadnouri/solana that referenced this issue Mar 10, 2022
StakeDelegations is using Arc to implement copy-on-write semantics:
https://github.com/solana-labs/solana/blob/58c0db970/runtime/src/stake_delegations.rs#L14-L16

However a single delegation change will still clone the entire hash-map,
resulting in excessive memory use as observed in:
solana-labs#23061 (comment)

This commit instead uses immutable hash-map implementing structural
sharing:
> which means that if two data structures are mostly copies of each
> other, most of the memory they take up will be shared between them.
https://docs.rs/im/latest/im/
@behzadnouri
Copy link
Contributor

behzadnouri commented Mar 16, 2022

@behzadnouri at first glance, #23692 seems to increase the RAM usage. Usage is at about 60% with 3700 banks. Extrapolating it, I think it'll OOM at about 8000 banks (compared to around 10K banks with #23585).

hmmm, that is very confusing. I don't see how theoretically that can happen. Worst case should be the same.
Maybe jemalloc profiles may show what changed.

The previous call graph from jemalloc profile shows Arc::make_mut just below StakesCache::check_and_store.
The only Arc::make_mut I see on that path are VoteAccounts and staked_nodes, and I believe the former is the bigger one. Or, maybe I am missing something.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 16, 2022

I'll let it run some more and then collect the trace.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 16, 2022

@behzadnouri here is the PDF with the profiling trace.
heap_with_PR_23692.pdf

@jeffwashington
Copy link
Contributor

Fwiw, all my changes related to this are making their way into master.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 17, 2022

@behzadnouri I think something has regressed on the master branch compared to your PR branch (#23585). There are a bunch of commits since that PR. I'll try to re-run your PR branch tomorrow and see if it still has the expected performance. If it does, we can try to narrow down the commit that regressed it.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 22, 2022

@behzadnouri I did a test run today with master branch and #23692 (after rebasing on the same commit as master). Both are yielding same performance. About 8700 banks were created before the node ran out of RAM.

Master branch was on a1a29b0 commit hash.

On that note, there has been some regression in past week or so. Earlier we were able to create about 10K banks (with #23585).

These runs were without profiling changes, just to make sure there's no impact due to profiling. I'll rerun master tomorrow after enabling profiling. That should give some information on what's the next biggest consumer of RAM and if anything can be optimized.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 22, 2022

This is from the latest profiling trace I have collected (based on commit a1a29b0)
@jeffwashington, @behzadnouri FYI, in case anything obvious/interesting stands out.

heap_03_22_22.pdf

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 23, 2022

Looks like the following code is causing cloning

let vote_accounts = Arc::make_mut(&mut self.vote_accounts);

It correlates with this part of the heap trace
image

Reading make_mut documentation, it says this

If there are other Arc pointers to the same allocation, then make_mut will clone the inner value to a new allocation to ensure unique ownership. This is also referred to as clone-on-write.

Likely that's what is causing the cloning here.

As an experiment, I commented out call to update_vote_account() in stakes.rs (that in turn would have called insert()). The RAM usage is much flat with that change.

A few questions:

  1. Any ideas on which entity might be holding reference to the vote_accounts Arc? Can this be redesigned to address this cloning?
  2. If not, can we offload any of this to disk when the node is stuck and not making any roots?

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 23, 2022

For comparison, here's the RAM usage for two runs.

Master branch
image

Comment out call to update_vote_account() in stakes.rs
image

It seems quite stable at 37% for now. On master, it had already crossed 60% by this time.

I'll leave my node run for a few more hours, just to see how the RAM usage grows.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 23, 2022

@behzadnouri any ideas on #23061 (comment)?

@t-nelson
Copy link
Contributor

I think this is what #23692 was attempting to address?

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 23, 2022

I think this is what #23692 was attempting to address?

Yes, I think so. Unfortunately, there was no change in RAM usage with that PR.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 23, 2022

See this comment for reference: #23061 (comment)

@behzadnouri
Copy link
Contributor

@behzadnouri any ideas on #23061 (comment)?

Let me do some testing and I will get back to you.

Meanwhile, it will be great if you can also test #23687 in your setup and compare with what I got in my test runs: #23687 (comment). Thank you,

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 24, 2022

Here's the heap profiling trace from the experiment (comment out update_vote_account()).
heap_comment_out_vote.pdf

The heap was growing at a very low rate (about 1.5% per hour). And, the node was still up after 12 hours. The usage was about 60%.

I have started a run with #23687. I have also cherry-picked the changes from #23585, to remove the cases that are already addressed on master branch. Unfortunately, I could not do a clean rebase of the branch with master.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 24, 2022

So far, I am liking #23687. Here's the RAM usage graph with it

image

I'll restart the node after rebasing this on top of master. The current run is on an old baseline. So not ideal for comparison.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 24, 2022

After picking the changes to master, #23687 is not helping.

image

Not sure if the previous run (with an older baseline) was a one off, or something has changed in AccountSharedData since then that's negating the benefits. I still have a branch with the old commit. I'll rerun it just to make sure.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 25, 2022

#23687 is working well on the PR branch (based on commit 9b46f9b). I retested it last night. The RAM usage is same as this #23061 (comment).

I tried to analyze the difference between master (commit a1a29b0) and the PR, and f54e746 was one of the change in the related area. Reverting it didn't help. And, the node ran out of RAM similar to #23061 (comment)

I'll start bisecting the differences between 9b46f9b and a1a29b0. @jeffwashington, @behzadnouri, I am wondering if you are aware of any recent change that could be causing this?

@behzadnouri
Copy link
Contributor

behzadnouri commented Mar 25, 2022

I'll start bisecting the differences between 9b46f9b and a1a29b0. @jeffwashington, @behzadnouri, I am wondering if you are aware of any recent change that could be causing this?

I believe 9b46f9b included a number of commits which had consensus issues and would fork off the validator: #23672 (comment)
I am suspecting that is impacting test results. Also #23682

$ git log 9b46f9b..a1a29b0 --author Meißner --pretty=oneline
dda3a463a2663c4642a70e403590a8a867002195 Fix duplicate account translation in CPI (#23701)
584ac80b1eae7b673cd014e5f6bfb70f01d66aea Revert: `KeyedAccount` refactoings in builtin programs (#23649)

I am also wondering when do you stop the validator from making roots? Is it from the start or after it has caught up with the cluster?

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 25, 2022

I am also wondering when do you stop the validator from making roots? Is it from the start or after it has caught up with the cluster?

The test validator never makes root in this case.

@behzadnouri
Copy link
Contributor

I am also wondering when do you stop the validator from making roots? Is it from the start or after it has caught up with the cluster?

The test validator never makes root in this case.

I don't know if that makes any difference, but what I have been doing in testing is to leave the validator until it is in sync and has caught up with the cluster, then I stop it from making roots to see how it changes.

@pgarg66
Copy link
Contributor Author

pgarg66 commented Mar 28, 2022

I tried bisecting the difference between #23061 (comment) and #23061 (comment) over the weekend. I ended up rolling all the way back to the branch that #23687 is based on. The results are not conclusive. Just now, for the same commit in two different runs, it ran out of memory once and followed the graph of #23061 (comment) in the other run.

At this point, I am not convinced that #23687 is actually solving the problem. The difference could be based on the network load and other conditions. Need more thoughts on a potential solution to vote accounts (reference: #23061 (comment))

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 29, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

9 participants