Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

litep2p: Kusama validator crashes repeatedly #5595

Open
sandreim opened this issue Sep 4, 2024 · 6 comments
Open

litep2p: Kusama validator crashes repeatedly #5595

sandreim opened this issue Sep 4, 2024 · 6 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@sandreim
Copy link
Contributor

sandreim commented Sep 4, 2024


Version: 1.15.1-afab63e5a7b

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:220
This is a bug. Please report it at:
	https://github.com/paritytech/polkadot-sdk/issues/new
====================
   0: sp_panic_handler::panic_hook
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:166:18
      sp_panic_handler::set::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:62:12
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
      std::panicking::rust_panic_with_hook
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:651:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
   4: rust_begin_unwind
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
   5: core::panicking::panic_fmt
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
   6: core::panicking::panic
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:144:5
   7: litep2p::protocol::transport_service::TransportService::on_connection_closed
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:220:13
      <litep2p::protocol::transport_service::TransportService as futures_core::stream::Stream>::poll_next
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:381:42
   8: futures_util::stream::stream::StreamExt::poll_next_unpin
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/mod.rs:1638:9
      <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/next.rs:32:21
      litep2p::protocol::request_response::RequestResponseProtocol::run::{{closure}}::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/macros/select.rs:524:49
      <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll

Full Logs: https://grafana.teleport.parity.io/goto/nlUPrk6SR?orgId=1

@sandreim sandreim added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Sep 4, 2024
@sandreim
Copy link
Contributor Author

sandreim commented Sep 4, 2024

CC @paritytech/networking

@alexggh
Copy link
Contributor

alexggh commented Sep 4, 2024

It seems a few debug_asserts that are thrown: https://github.com/paritytech/litep2p/blob/master/src/protocol/transport_service.rs#L237C13-L237C25

@alexggh
Copy link
Contributor

alexggh commented Sep 4, 2024

I think the deployed image from here did not have them stripped out: https://github.com/paritytech/devops/issues/3519#issuecomment-2329696367

@dmitry-markin
Copy link
Contributor

For the reference, this is where it panics:
https://github.com/paritytech/litep2p/blob/v0.6.2/src/protocol/transport_service.rs#L212-L222

let Some(context) = self.connections.get_mut(&peer) else {
    tracing::warn!(
        target: LOG_TARGET,
        ?peer,
        ?connection_id,
        "connection closed to a non-existent peer",
    );

    debug_assert!(false);
    return None;
};

@lexnv
Copy link
Contributor

lexnv commented Sep 5, 2024

  • Substrate version afab63e5a7b is build from branch origin/AndreiEres/pvf-execution-priority
  • litep2p version 0.6.2

Investigating the panics, it seems indeed we are triggering debug_aserts.
Indeed, we need to double-check our images are build with --release.
These defensive failures are put in place to help us debug issues sooner, in this case, race-conditions and code expectations.

Issues

Connection closed for a nonexistent peer

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:220

Code path:
https://github.com/paritytech/litep2p/blob/a27d0072913be67cce904e2e8c465cc37ec5223a/src/protocol/transport_service.rs#L220

Expecting different state for a dial failure report

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/transport/manager/mod.rs:748

Code path:
https://github.com/paritytech/litep2p/blob/a27d0072913be67cce904e2e8c465cc37ec5223a/src/transport/manager/mod.rs#L748

Connection closed without primary and secondary connections

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/transport/manager/mod.rs:858

Code path:

https://github.com/paritytech/litep2p/blob/a27d0072913be67cce904e2e8c465cc37ec5223a/src/transport/manager/mod.rs#L858

Next steps

  • Investigate the above code paths
  • Release 0.7 (soon) and test the latest version
  • Ensure we always strip debug symbols for prod / validators

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 5, 2024

We can also use "release with debug info" builds to simplify crash reports analysis. I.e., no debug assertions, but still debugging symbols. It doesn't look polkadot-sdk Cargo.toml has such build profile:

polkadot-sdk/Cargo.toml

Lines 1371 to 1389 in 49a6813

[profile.release]
# Polkadot runtime requires unwinding.
opt-level = 3
panic = "unwind"
# make sure dev builds with backtrace do not slow us down
[profile.dev.package.backtrace]
inherits = "release"
[profile.production]
codegen-units = 1
inherits = "release"
lto = true
[profile.testnet]
debug = 1 # debug symbols are useful for profilers
debug-assertions = true
inherits = "release"
overflow-checks = true

EDIT: we don't strip the debug symbols in the release profile (no strip = true), so the debug symbols are included, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

No branches or pull requests

4 participants