Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

req-resp: Connection closed with peer that does not exist #290

Closed
lexnv opened this issue Nov 26, 2024 · 4 comments
Closed

req-resp: Connection closed with peer that does not exist #290

lexnv opened this issue Nov 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@lexnv
Copy link
Collaborator

lexnv commented Nov 26, 2024

Grafana link: https://grafana.teleport.parity.io/goto/JgobqPnHg?orgId=1

2024-11-26 12:04:36.842 ERROR tokio-runtime-worker litep2p::request-response::protocol: state mismatch: peer doesn't exist peer=PeerId("12D3KooWLXZ2UWMy8QBpGEY1u4x4TdXpa5TVdBhaxkujKU9RvRub")


Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.8.1/src/protocol/request_response/mod.rs:301
This is a bug. Please report it at:
	https://github.com/paritytech/polkadot-sdk/issues/new
	
	

	
2024-11-26 12:10:34.786 ERROR tokio-runtime-worker litep2p::request-response::protocol: state mismatch: peer doesn't exist peer=PeerId("12D3KooWRZih24K9HWT6Qg5FckD4F14jRnG4nqNqijtx78uEPNaa")
====================
Version: 1.16.1-a5c037996d5
   0: sp_panic_handler::panic_hook
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:170:18
      sp_panic_handler::set::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:63:12
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/alloc/src/boxed.rs:2084:9
      std::panicking::rust_panic_with_hook
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:808:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:667:13
   3: std::sys::backtrace::__rust_end_short_backtrace
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/sys/backtrace.rs:168:18
   4: rust_begin_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:665:5
   5: core::panicking::panic_fmt
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:74:14
   6: core::panicking::panic
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:148:5
   7: litep2p::protocol::request_response::RequestResponseProtocol::on_connection_closed::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.8.1/src/protocol/request_response/mod.rs:301:13
      litep2p::protocol::request_response::RequestResponseProtocol::handle_service_event::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.8.1/src/protocol/request_response/mod.rs:897:49
   8: litep2p::protocol::request_response::RequestResponseProtocol::run::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.8.1/src/protocol/request_response/mod.rs:1029:69
   9: litep2p::Litep2p::new::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.8.1/src/lib.rs:211:69
  10: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/future/future.rs:123:9
      <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/service/src/task_manager/prometheus_future.rs:61:3
  11: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:297:9
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:42
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:557:40
      std::panicking::try
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:521:19
      std::panic::catch_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panic.rs:350:14
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:9
      <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/future/future.rs:123:9
      futures_util::future::future::FutureExt::poll_unpin
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.31/src/future/future/mod.rs:558:9
      <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.31/src/future/select.rs:118:35
  12: sc_service::task_manager::SpawnTaskHandle::spawn_inner::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/service/src/task_manager/mod.rs:168:33
      <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-futures-0.2.5/src/lib.rs:283:9
  13: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/core.rs:331:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/loom/std/unsafe_cell.rs:16:9
      tokio::runtime::task::core::Core<T,S>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/core.rs:320:30
  14: tokio::runtime::task::harness::poll_future::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:500:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:557:40
      std::panicking::try
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:521:19
      std::panic::catch_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panic.rs:350:14
      tokio::runtime::task::harness::poll_future
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:488:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:209:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:154:15
  15: tokio::runtime::task::raw::RawTask::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/raw.rs:201:18
      tokio::runtime::task::LocalNotified<S>::run
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/mod.rs:436:9
      tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:598:18
      tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:107:5
      tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:73:5
      tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:597:9
  16: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:548:24
  17: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:513:21
      tokio::runtime::context::scoped::Scoped<T>::set
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/scoped.rs:40:9
  18: tokio::runtime::context::set_scheduler::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context.rs:180:26
      std::thread::local::LocalKey<T>::try_with
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/thread/local.rs:283:12
      std::thread::local::LocalKey<T>::with
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/thread/local.rs:260:9
      tokio::runtime::context::set_scheduler
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context.rs:180:17
      tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:508:9
      tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/runtime.rs:65:16
  19: tokio::runtime::scheduler::multi_thread::worker::run
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:500:5
  20: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/multi_thread/worker.rs:466:45
      <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/blocking/task.rs:42:21
  21: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/core.rs:331:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/loom/std/unsafe_cell.rs:16:9
      tokio::runtime::task::core::Core<T,S>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/core.rs:320:30
  22: tokio::runtime::task::harness::poll_future::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:500:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:557:40
      std::panicking::try
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:521:19
      std::panic::catch_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panic.rs:350:14
      tokio::runtime::task::harness::poll_future
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:488:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:209:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/harness.rs:154:15
  23: tokio::runtime::task::raw::RawTask::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/raw.rs:201:18
      tokio::runtime::task::UnownedTask<S>::run
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/task/mod.rs:473:9
      tokio::runtime::blocking::pool::Task::run
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/blocking/pool.rs:160:9
      tokio::runtime::blocking::pool::Inner::run
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/blocking/pool.rs:518:17
      tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/blocking/pool.rs:476:13
      std::sys::backtrace::__rust_begin_short_backtrace
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/sys/backtrace.rs:152:18
  24: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/thread/mod.rs:538:17
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:557:40
      std::panicking::try
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:521:19
      std::panic::catch_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panic.rs:350:14
      std::thread::Builder::spawn_unchecked_::{{closure}}
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/thread/mod.rs:537:30
      core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250:5
  25: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/alloc/src/boxed.rs:2070:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/alloc/src/boxed.rs:2070:9
      std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/sys/pal/unix/thread.rs:108:17
  26: <unknown>
  27: <unknown>
Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.8.1/src/protocol/request_response/mod.rs:301
This is a bug. Please report it at:
	https://github.com/paritytech/polkadot-sdk/issues/new
@lexnv
Copy link
Collaborator Author

lexnv commented Nov 26, 2024

Code Assumptions

The issue manifests in request-response protocol when receiving TransportEvent::ConnectionClosed for a connection that was not previously reported by a TransportEvent::ConnectionEstablished event:

TransportEvent::ConnectionClosed { peer } => {
self.on_connection_closed(peer).await;
}

The events originate from the transport service component:

Some(InnerTransportEvent::ConnectionEstablished {
peer,
endpoint,
sender,
connection,
}) => {
if let Some(event) =
self.on_connection_established(peer, endpoint, connection, sender)
{
return Poll::Ready(Some(event));
}
}
Some(InnerTransportEvent::ConnectionClosed { peer, connection }) => {
if let Some(event) = self.on_connection_closed(peer, connection) {
return Poll::Ready(Some(event));
}
}

It is clear that the transport service contains the appropriate information of connections, otherwise it would have paniced before the request-response protocol here:

let Some(context) = self.connections.get_mut(&peer) else {
tracing::warn!(
target: LOG_TARGET,
?peer,
?connection_id,
"connection closed to a non-existent peer",
);
debug_assert!(false);
return None;
};

Root Cause

TransportService reports only primary connections via TransportEvent::ConnectionEstablished and TransportEvent::ConnectionClosed.

T0: TransportService accepts connection A for peer X as primary, event reported
T1: TransportService accepts connection B for peer X as secondary, event not reported
T2: TransportService connection A closed for peer X, B is moved from secondary to primary, A is not reported
T3: TransportService connection B closed for peer X, B is now reported

Request Response protocol received events:

  • [peer X, connection Primary A] established
  • [peer X, connection Secondary B] closed

Secondary not reported as ConnectionEstablished:

self.keep_alive_tracker.on_connection_established(peer, connection_id);
context.secondary = Some(handle);
None

Secondary promoted to Primary without events:

Some(handle) => {
tracing::debug!(
target: LOG_TARGET,
?peer,
?connection_id,
"switch to secondary connection",
);
context.primary = handle;
return None;

@lexnv
Copy link
Collaborator Author

lexnv commented Nov 26, 2024

Since we report only peerIDs, and not (peerIDs++conectionIDs) it may also be possible that req-resp skipped / did not received ConnectionEstablished events

@lexnv
Copy link
Collaborator Author

lexnv commented Nov 26, 2024

The issue looks like a state mismatch in request-response handler:

2024-11-26 12:16:24.410  WARN tokio-runtime-worker litep2p::request-response::protocol: failed to open substream peer=PeerId("12D3KooWRZih24K9HWT6Qg5FckD4F14jRnG4nqNqijtx78uEPNaa") protocol=/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/req_chunk/2 request_id=RequestId(23232) error=ConnectionClosed
		
2024-11-26 12:16:24.410 ERROR tokio-runtime-worker litep2p::request-response::protocol: state mismatch: peer doesn't exist peer=PeerId("12D3KooWRZih24K9HWT6Qg5FckD4F14jRnG4nqNqijtx78uEPNaa")

Grafana link: https://grafana.teleport.parity.io/goto/YI3Ois7Ng?orgId=1

On connection established, we try to open a substream to the remote peer. However, the substream open request can fail with ConectionClosed error.
In this case, we do not track the peer in our internal mappings. The transportService gives us later a ConnectionClosed event for a connection that we do not tack.

Err(error) => {
tracing::warn!(
target: LOG_TARGET,
?peer,
protocol = %self.protocol,
request_id = ?context.request_id,
?error,
"failed to open substream",
);
return self
.report_request_failure(
peer,
context.request_id,
RequestResponseError::Rejected(error.into()),
)
.await;
}

This is simply an edge case where we discovered the connection is closed by relying on substream activity instead of events from TransportManager

lexnv added a commit that referenced this issue Nov 27, 2024
…291)

When a new connection is reported to the request-response protocol, the
protocol tracks the state of the connection if and only if it could open
a substream with the remote peer.

We do not track the state of the connection in cases where the substream
cannot be opened because the connection is closed.

A further `ConnectionClosed` event triggered a debug_assert assumption
for this case. The assumption is wrong since opening substreams can fail
which informs us that the connection is closed.


Connection details not tracked:

https://github.com/paritytech/litep2p/blob/eed0f755d591eb49e1b715a78aceb5a653ddf529/src/protocol/request_response/mod.rs#L267-L284

Which later leads to panics:

https://github.com/paritytech/litep2p/blob/d6fae559fbe2aaf8cb3fff021fb5608ca7b43766/src/protocol/request_response/mod.rs#L295-L301



Related to: #290

---------

Signed-off-by: Alexandru Vasile <[email protected]>
@lexnv
Copy link
Collaborator Author

lexnv commented Nov 27, 2024

Closed by: #291

@lexnv lexnv closed this as completed Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant