Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault : rocksdb : Too many open files #3098

Closed
skutcher opened this issue Oct 16, 2024 · 8 comments
Closed

segmentation fault : rocksdb : Too many open files #3098

skutcher opened this issue Oct 16, 2024 · 8 comments
Labels

Comments

@skutcher
Copy link

Summary

  • After my DB got destroyed by heavy load again, i tried to start with a snapshot as always. But without success. Every time got the same error (see below)

Steps to Reproduce

Environment

  • Node Version: 4.2.0
  • Block height on bug if applicable:
  • TX or TX type on bug if applicable:
  • OS with version: archlinux with 6.11.3-arch1-1 on btrfs

RUST_BACKTRACE=full

2024-10-16T18:59:04Z [ain_grpc] Starting JSON Eth RPC server at 127.0.0.1:8551
2024-10-16T18:59:04Z [ain_grpc] Init rs ws services
2024-10-16T18:59:04Z [ain_grpc] Starting ws server at 127.0.0.1:8553
2024-10-16T18:59:04Z init message: Done loading
2024-10-16T18:59:04Z 0 addresses found from DNS seeds
2024-10-16T18:59:04Z dnsseed thread exit
2024-10-16T18:59:04Z New outbound peer connected: version: 70038, blocks=4426259, peer=0
2024-10-16T18:59:05Z New outbound peer connected: version: 70038, blocks=4426259, peer=1
2024-10-16T18:59:05Z Synchronizing blockheaders, height: 4419682 (~99.85%)
2024-10-16T18:59:05Z New outbound peer connected: version: 70038, blocks=4426259, peer=2
2024-10-16T18:59:06Z Synchronizing blockheaders, height: 4421682 (~99.90%)
2024-10-16T18:59:06Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
thread '<unnamed>' panicked at /github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs:31:67:
called `Result::unwrap()` on an `Err` value: SimpleError { msg: SimpleMsg { err: "", file: "/github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs", line: 436, column: 57 }, cause: Some(SimpleError { msg: SimpleMsg { err: "IO error: While open a file for random read: /home/smo/.defi/evm/.vsdb/002338.sst: Too many open files", file: "/github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs", line: 436, column: 57 }, cause: None }) }
stack backtrace:
   0:     0x57906bc108af - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:     0x57906bc108af - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x57906bc108af - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x57906bc108af - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x57906b9e20bb - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
   5:     0x57906b9e20bb - core::fmt::write::hc090a2ffd6b28c4a
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
   6:     0x57906bbe3b52 - std::io::Write::write_fmt::h8898bac6ff039a23
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
   7:     0x57906bc12399 - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x57906bc12399 - std::sys_common::backtrace::print::ha96650907276675e
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x57906bc11bbe - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:271:22
  10:     0x57906bc12e59 - std::panicking::default_hook::h207342be97478370
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
  11:     0x57906bc12e59 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
  12:     0x57906bc12702 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
  13:     0x57906bc12659 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x57906bc12646 - rust_begin_unwind
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
  15:     0x57906ab74b72 - core::panicking::panic_fmt::hdc63834ffaaefae5
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
  16:     0x57906ab74f85 - core::result::unwrap_failed::h82b551e0ff2b2176
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5
  17:     0x57906bc5c398 - core::ops::function::FnOnce::call_once::h9be83c72fcb34d3b
  18:     0x57906bc634e8 - once_cell::imp::OnceCell<T>::initialize::{{closure}}::h16f939f811f483c4
  19:     0x57906bb29f3f - once_cell::imp::initialize_or_wait::h63f27d139639c331
  20:     0x57906ab9287b - once_cell::imp::OnceCell<T>::initialize::h15637d85abd0c950
  21:     0x57906bc66130 - core::ops::function::FnOnce::call_once::h3b47c4b74e2b2ad2
  22:     0x57906bc638db - once_cell::imp::OnceCell<T>::initialize::{{closure}}::ha9f41780e052a633
  23:     0x57906bb29f3f - once_cell::imp::initialize_or_wait::h63f27d139639c331
  24:     0x57906ab976ab - once_cell::imp::OnceCell<T>::initialize::h8e0d2283277274f6
  25:     0x57906bc8fe24 - vsdb_trie_db::MptStore::trie_restore::h3ca9b979e2e376ac
  26:     0x57906b50e865 - ain_evm::backend::EVMBackend::from_root::h171eaddb7be2fd68
  27:     0x57906b41dbd5 - cxxbridge1$evm_try_unsafe_create_template
  28:     0x57906b259f20 - _Z30evm_try_unsafe_create_templateR19CrossBoundaryResultmN4rust10cxxbridge13StrEjmm
  29:     0x57906ac81036 - _ZN15CScopedTemplate6CreateEmNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjmm
  30:     0x57906b1e8ec4 - _ZN11CChainState12ConnectBlockERK6CBlockR16CValidationStateP11CBlockIndexR15CCoinsViewCacheR13CCustomCSViewRK12CChainParamsRbb
  31:     0x57906b1f81cb - _ZN11CChainState10ConnectTipER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockER12ConnectTraceR29DisconnectedBlockTransactions
  32:     0x57906b1f9b99 - _ZN11CChainState21ActivateBestChainStepER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockERbR12ConnectTrace
  33:     0x57906b1fc4d1 - _ZN11CChainState17ActivateBestChainER16CValidationStateRK12CChainParamsSt10shared_ptrIK6CBlockE
  34:     0x57906b2068ea - _Z15ProcessNewBlockRK12CChainParamsSt10shared_ptrIK6CBlockEbPb
  35:     0x57906b03eafd - _ZL14ProcessMessageP5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamlRK12CChainParamsP8CConnmanRKSt6atomicIbE
  36:     0x57906b033bdc - _ZN19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  37:     0x57906b042397 - _ZThn8_N19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  38:     0x57906b0198a9 - _ZN8CConnman20ThreadMessageHandlerEv
  39:     0x57906b029c3a - _ZNSt17_Function_handlerIFvvESt5_BindIFM8CConnmanFvvEPS2_EEE9_M_invokeERKSt9_Any_data
  40:     0x57906b001136 - _Z11TraceThreadISt8functionIFvvEEEvPKcT_
  41:     0x57906b029dfc - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPKcSt8functionIFvvEEES4_S7_EEEEE6_M_runEv
  42:     0x57906c543194 - execute_native_thread_routine
  43:     0x78b6928db39d - <unknown>
  44:     0x78b69296049c - <unknown>
  45:                0x0 - <unknown>
thread '<unnamed>' panicked at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cxx-1.0.128/src/unwind.rs:37:9:
panic in ffi function ain_rs_exports::ffi::evm_try_unsafe_create_template, aborting.
stack backtrace:
   0:     0x57906bc108af - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:     0x57906bc108af - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x57906bc108af - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x57906bc108af - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x57906b9e20bb - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
   5:     0x57906b9e20bb - core::fmt::write::hc090a2ffd6b28c4a
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
   6:     0x57906bbe3b52 - std::io::Write::write_fmt::h8898bac6ff039a23
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
   7:     0x57906bc12399 - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x57906bc12399 - std::sys_common::backtrace::print::ha96650907276675e
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x57906bc11bbe - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:271:22
  10:     0x57906bc12e59 - std::panicking::default_hook::h207342be97478370
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
  11:     0x57906bc12e59 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
  12:     0x57906bc12702 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
  13:     0x57906bc12659 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x57906bc12646 - rust_begin_unwind
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
  15:     0x57906ab74b72 - core::panicking::panic_fmt::hdc63834ffaaefae5
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
  16:     0x57906ab75be7 - <cxx::unwind::Guard as core::ops::drop::Drop>::drop::h3c636553f50760c4
  17:     0x57906b41e635 - cxxbridge1$evm_try_unsafe_create_template
  18:     0x57906b259f20 - _Z30evm_try_unsafe_create_templateR19CrossBoundaryResultmN4rust10cxxbridge13StrEjmm
  19:     0x57906ac81036 - _ZN15CScopedTemplate6CreateEmNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjmm
  20:     0x57906b1e8ec4 - _ZN11CChainState12ConnectBlockERK6CBlockR16CValidationStateP11CBlockIndexR15CCoinsViewCacheR13CCustomCSViewRK12CChainParamsRbb
  21:     0x57906b1f81cb - _ZN11CChainState10ConnectTipER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockER12ConnectTraceR29DisconnectedBlockTransactions
  22:     0x57906b1f9b99 - _ZN11CChainState21ActivateBestChainStepER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockERbR12ConnectTrace
  23:     0x57906b1fc4d1 - _ZN11CChainState17ActivateBestChainER16CValidationStateRK12CChainParamsSt10shared_ptrIK6CBlockE
  24:     0x57906b2068ea - _Z15ProcessNewBlockRK12CChainParamsSt10shared_ptrIK6CBlockEbPb
  25:     0x57906b03eafd - _ZL14ProcessMessageP5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamlRK12CChainParamsP8CConnmanRKSt6atomicIbE
  26:     0x57906b033bdc - _ZN19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  27:     0x57906b042397 - _ZThn8_N19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  28:     0x57906b0198a9 - _ZN8CConnman20ThreadMessageHandlerEv
  29:     0x57906b029c3a - _ZNSt17_Function_handlerIFvvESt5_BindIFM8CConnmanFvvEPS2_EEE9_M_invokeERKSt9_Any_data
  30:     0x57906b001136 - _Z11TraceThreadISt8functionIFvvEEEvPKcT_
  31:     0x57906b029dfc - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPKcSt8functionIFvvEEES4_S7_EEEEE6_M_runEv
  32:     0x57906c543194 - execute_native_thread_routine
  33:     0x78b6928db39d - <unknown>
  34:     0x78b69296049c - <unknown>
  35:                0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:228:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Aborted (core dumped)
@skutcher skutcher added the bug label Oct 16, 2024
@skutcher
Copy link
Author

it seems btw to work with https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4268284.zip

but this seems not to be the problem, cause the fd count already drops to around 1600 before the error occurs

  • increasing the soft limit by a factor of 4 (ulimit -n 4096) we see an increased discriptor count of 2298 but than results in
2024-10-16T20:52:58Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
2024-10-16T20:52:58Z UpdateTip: height=4423469 hash=e41b0298011b8e76244eebe13db5bdd6e6c3a23bfe561fb8112114f98174698e date='2024-10-15T20:00:42Z' tx=34709360 log2_work=88.520461 progress=0.999525
*** buffer overflow detected ***: terminated
*** buffer overflow detected ***: terminated

script to measure open file descriptors

ps axco pid,comm | grep defid | xargs -n 2 -- bash -c 'echo -n "$1 $2 ";lsof -p $1 2>/dev/null | wc -l' argv0
(Please note that lsof | wc -l sums up a lot of duplicated entries (forked processes can share file handles etc).)

@prasannavl
Copy link
Member

Related: facebook/rocksdb#10367

@prasannavl
Copy link
Member

Thanks for the detailed info @skutcher. We're looking at options to workaround this.

For now, please push the kernel limits and ulimits to higher. We've observed anything 16k+ usually works well. We'll look at a more refined fix soon.

@skutcher
Copy link
Author

yfyi: my default system limit:

limit -Hn
524288
ulimit -Sn
1024

but as mentioned above if i increase the softlimit i got just buffer overflows

@prasannavl
Copy link
Member

but as mentioned above if i increase the softlimit i got just buffer overflows

I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.

This is expected behavior. If you have issues with a clean data / snapshot, please let us know. We do run our machines with 1024 * 1024 or 16k as the file open limits depending on the system.

We do intend to workaround the rocksdb bug, but however, meanwhile the above should be a viable workaround. I've also expanded a bit more on #3101 (comment)

@skutcher
Copy link
Author

but as mentioned above if i increase the softlimit i got just buffer overflows

I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.

I am pretty sure it was with an fresh snapshot of https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4423468.zip . I will check again to be 100% sure

@prasannavl
Copy link
Member

prasannavl commented Oct 28, 2024

We were able to test with stricts ulimit < 1024, and now the default should work well with the default of 1024 even though it's recommended to have higher limits.

Believe this issue should be resolved with the latest master.

@prasannavl
Copy link
Member

Closing, as this is now resolved. Please feel free to reopen if you continue to have issues with >= 4.2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants