Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: Remove ConnectionHandler::Error #3591

Closed
9 tasks done
Tracked by #2863
thomaseizinger opened this issue Mar 11, 2023 · 17 comments · Fixed by #4755
Closed
9 tasks done
Tracked by #2863

swarm: Remove ConnectionHandler::Error #3591

thomaseizinger opened this issue Mar 11, 2023 · 17 comments · Fixed by #4755

Comments

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Mar 11, 2023

Based on #3353 (comment), I am opening this issue to track the work around removing ConnectionHandlerEvent::Close. Closing a connection is the only usage of ConnectionHandler::Error but we've agree that ConnectionHandlers themselves should not be allowed to close connections because they don't fully own them and might disrupt other protocols.

Tasks

  1. send-it
  2. send-it
  3. send-it
  4. internal-change
  5. internal-change send-it
  6. internal-change send-it
  7. send-it
  8. send-it
  9. internal-change send-it
@mxinden
Copy link
Member

mxinden commented Mar 13, 2023

Audit the codebase for usages of Close and port all protocols to set KeepAlive::No instead of closing the connection directly.

Based on intuition, with no particular instance in mind, I am not sure this is possible, i.e. I expect us to find a use-case for ConnectionHandlerEvent::Close. That said, I would like to be proven wrong and thus I am in favor of proceeding on this GitHub issue.

@thomaseizinger
Copy link
Contributor Author

Audit the codebase for usages of Close and port all protocols to set KeepAlive::No instead of closing the connection directly.

Based on intuition, with no particular instance in mind, I am not sure this is possible, i.e. I expect us to find a use-case for ConnectionHandlerEvent::Close. That said, I would like to be proven wrong and thus I am in favor of proceeding on this GitHub issue.

Currently, we tend to close connection upon certain errors. That is wrong in my opinion. Just because one or more streams on one handler fail doesn't mean the connection is faulty or that the other peer is necessarily misbehaving. Do we agree that in those instances, setting KeepAlive::No is more appropriate than closing the connection?

@mxinden
Copy link
Member

mxinden commented Mar 13, 2023

Do we agree that in those instances, setting KeepAlive::No is more appropriate than closing the connection?

Agreed.

@vnermolaev
Copy link
Contributor

I think I can pick up this issue as "a spiritual follow-up" of #3577.

I scouted the code, and it seems that ConnectionHandlerEvent::Close is only used in two cases

  • as a result of polling behaviour, i.e.,return Poll::Ready(ConnectionHandlerEvent::Close(err));
    The inner error can contain different values, e.g.,

    return Poll::Ready(ConnectionHandlerEvent::Close(
        HandlerError::MaxInboundSubstreams,
    ));
    return Poll::Ready(ConnectionHandlerEvent::Close(
        ConnectionHandlerUpgrErr::Upgrade(UpgradeError::Apply(Either::Left(e))),
    ))

    This information will be lost (at best printed out). Is this a desired behaviour? In such cases, I can set self.keep_alive = KeepAlive::No. Should Poll::Pending be returned (it seems the only choice, though)?

  • in rewrapping actions, e.g., ConnectionHandlerEvent::Close(val) => ConnectionHandlerEvent::Close(map(val)),. This can be deprecated for consequent removal.

@thomaseizinger
Copy link
Contributor Author

Sure!

Let me give you some more context so you can work through this properly. It is a slightly more advanced issue!

First, I'd recommend you to read through #3353 (comment).

Then have a look at this section in case you haven't: https://github.com/libp2p/rust-libp2p/blob/master/docs/coding-guidelines.md#hierarchical-state-machines

Each NetworkBehaviour is given one ConnectionHandler per physical connection. All ConnectionHandlers of a single connection are composed together. What is issue is essentially about is the fact that one of these ConnectionHandlers can close the underlying physical connection which obviously affect all other protocols that run in parallel.

Currently, we often do that upon what is termed "fatal" errors. However, those errors are usually only fatal for this one ConnectionHandler and not for all handlers that are running on the same connection. As you said correctly, if we stop returning the errors, this information is "lost". I think a log message on info level is good enough here. info is typically used for state changes and the state change we are making here is that we "disable" the handler by setting KeepAlive::No.

Poll::Pending should only be returned if there is no work to do within the handler, i.e. all queues are empty etc. Most likely, the pattern you will see is either that we check for an Option<Error> to be Some, in which case the statement can be removed entirely, or that a VecDeque of events is collected and we push a ConnectionHandlerEvent::Close into it in which case we only need to remove the code that pushes the Close event into it.

Does this make sense?

In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate.

@vnermolaev
Copy link
Contributor

Then have a look at this section in case you haven't: https://github.com/libp2p/rust-libp2p/blob/master/docs/coding-guidelines.md#hierarchical-state-machines

OMG, where has this gem been all my Rust life? I've googled a lot on SMs, but this was never among the results.

In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate.

Agreed, let me draft a small PR for some protocol, work out all the quirks and then attempt to apply the same procedure to the rest.

@thomaseizinger
Copy link
Contributor Author

Then have a look at this section in case you haven't: master/docs/coding-guidelines.md#hierarchical-state-machines

OMG, where has this gem been all my Rust life? I've googled a lot on SMs, but this was never among the results.

Our connection state machine is one of my favourites:

pub struct Connection<THandler>

In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate.

Agreed, let me draft a small PR for some protocol, work out all the quirks and then attempt to apply the same procedure to the rest.

Looking forward to it! Thank you for your help :)

@vnermolaev
Copy link
Contributor

Look, there is an approach i found in gossipsub

NegotiationError::Failed => {
// The protocol is not supported
self.protocol_unsupported = true;
if !self.peer_kind_sent {
self.peer_kind_sent = true;
// clear all substreams so the keep alive returns false
self.inbound_substream = None;
self.outbound_substream = None;
self.keep_alive = KeepAlive::No;
return Poll::Ready(ConnectionHandlerEvent::Custom(
HandlerEvent::PeerKind(PeerKind::NotSupported),
));

compared to the following piece

if let Some(error) = reported_error {
return Poll::Ready(ConnectionHandlerEvent::Close(error));
}

In the former case, this particular handler will be closed, while the error is still reported to the parent.
In the latter case, the full connection will be closed (and such behaviour is undesiarable).
It is worth noting that both inner types of ::Custom and ::Close are the same HandlerEvent.

Would the former approach be acceptable for the latter case? It seems to reach the set goals: the handler will be closed, but also the error will be propagated to the parent, where it can be appropriately reported.

@thomaseizinger
Copy link
Contributor Author

Would the former approach be acceptable for the latter case? It seems to reach the set goals: the handler will be closed, but also the error will be propagated to the parent, where it can be appropriately reported.

I don't think it is worth reporting these errors to the parent. They might be transient, i.e. only a single stream might have failed but otherwise the connection is intact. This actually makes me think, it could be that we have other streams that are still operating (i.e. inbound_stream or outbound_stream might still be Some). Thus I am inclined to say:

  • Upon errors on a single stream, all we really should be doing is log the error
  • KeepAlive should only depend on whether we are currently doing work

I think the above applies to all protocols. This is a slight deviation from what I previously thought we should do! Setting KeepAlive::No on any error seems to be too aggressive. As @mxinden noted in #3482 (comment), a remote might just drop one particular stream for whatever reason. We should be resilient to that!

The combination of the above means that in case all our streams fail, we will eventually close the connection because they are set to None. Also note that gossipsub in particular has a limit of creating a maximum of 5 streams. Once we hit that limit, we should set KeepAlive::No instead of closing the connection. That applies to these lines here:

if self.inbound_substreams_created > MAX_SUBSTREAM_CREATION {
// Too many inbound substreams have been created, end the connection.
return Poll::Ready(ConnectionHandlerEvent::Close(
HandlerError::MaxInboundSubstreams,
));
}

@AgeManning
Copy link
Contributor

I just read through this. I'm one of the people that have been closing connections on errors (in gossipsub) :p.

I'm interested in the negotiation errors. And have a few questions about some logic to handle.

I see these often. I think a good portion of the time, is when trying to connect to a go-libp2p client and for one reason or another (usually they have too many peers) they drop the initial connection. From our side, it can look like a negotiation error. In the earlier days, I think they were not responding, but not dropping the connection. It was up to us to drop the connection when negotiation failed. In the new version, we may keep these stale connections alive until the KeepAlive timeout ends?

There were also attacks where an attacker would consume all of the local file descriptors by constantly trying to open many many new connections and keeping them alive. We mitigated this somewhat by capping the number of inbound connections, but also dropping the connections pretty fast when negotiations or connections failed. Could we be at risk if we now leave these connections alive for longer?

When a node goes offline, we can identify it via Pings or various other negotiation timeouts. In the past I've found it useful to drop connections faster so that we can run discovery and search for new peers, rather than keeping stale old connections alive.

I think all of these can be mitigated I guess by the keep alive timeout? Curious about your thoughts on these.

@thomaseizinger
Copy link
Contributor Author

In the new version, we may keep these stale connections alive until the KeepAlive timeout ends?

Unless a ConnectionHandler sets KeepAlive::Until, terminating the connection is instant with KeepAlive::No. My approach would be that any protocol needs to disable its handler instantly if it receives a negotiation error. If we go forward with an approach such as #3651 for #2680, then a ConnectionHandler could even eagerly disable itself once we receive the "supported protocols" (via identify f.e.) and the current one is not present. A quick sketch could be (Disclaimer: I only came up with this idea yesterday, there might still be flaws):

  • Connection is established
  • Some protocols, like identify, run upon connection establishment
  • They receive a response and emit ConnectionHandlerEvent::ReportRemoteProtocols
  • The list of protocols is immediately handed back to the ConnectionHandler via ConnectionEvent::RemoteProtocolsChange
  • Each ConnectionHandler could instantly set themselves to KeepAlive::No if their own protocol is not present in the list

Curious about your thoughts on these.

The longer discussion around this is in #3353 (comment). I think eagerly closing connections on unsupported protocols is a reasonable policy. I also think that policy does not belong into rust-libp2p. It is something each application has to decide for themselves. My recommendation would be that protocols should emit an event if they disable themselves (perhaps we can even provide something standard for this?). This event can be captured by the behaviour and emitted to the Swarm. At the application-level, you can then close connections to peers that e.g. don't support gossipsub in case your application relies on that.

I don't want to derail this too much but the entire KeepAlive mechanism may also need some more thought. For example, should a protocol like identify keep the connection alive? It is more of an "auxiliary" protocol in my eyes but if no other protocols need the connection, it could be closed.

@AgeManning
Copy link
Contributor

Yeah I agree.

We probably need a standardized way of letting the users decide which protocols should allow the connection to be kept alive. Maybe in the derive macro when composing behaviours we could tell the macro which sub-behaviours we want to keep the connection alive.

@rkuhn
Copy link
Contributor

rkuhn commented Mar 23, 2023

I’d think that it depends on the protocol if KeepAlive::No makes sense. If the user (of the derived behaviour) wants to keep connections around then that should probably be done by offering a NetworkBehaviour that does just that — previously this was allowed by the ping protocol, but the functionality has — correctly, I think — been decoupled from ping.

While I do agree with @mxinden’s proposal above, I’m not sure whether it addresses the issue @AgeManning raised: if we need to interact with a broken peer (like go-ipfs sounds to be) who “signals” the desire that we should close the connection by reporting a nondescript stream negotiation failure, then we might need to keep the ability for a stream protocol handler to close the connection. Alternatively, go-ipfs could be fixed …


Regarding the overall topic: it might well be reasonable to run a protocol on a stream that performs admin coordination. Such a handler may then want to shut down the connection when agreed with the peer, independent of other stream handlers setting KeepAlive::Yes. If the connection handler event goes away, then this will need to be done by emitting an event into the corresponding admin NetworkBehaviour and closing the connection from there, right?

@thomaseizinger
Copy link
Contributor Author

We probably need a standardized way of letting the users decide which protocols should allow the connection to be kept alive. Maybe in the derive macro when composing behaviours we could tell the macro which sub-behaviours we want to keep the connection alive.

Configuration is one option (pun intended) but adding configuration options is a fight we are not going to win as more usecases come up. I am a fan of thinking in policies and mechanism. rust-libp2p should provide a mechanism for closing connections and users should bring the policy on when to use that.

If the connection handler event goes away, then this will need to be done by emitting an event into the corresponding admin NetworkBehaviour and closing the connection from there, right?

Yes, that is the idea.


I am starting to wonder whether the entire connection_keep_alive functionality is a good idea. Wouldn't everyone be served better if we don't close connections automatically?

  1. ConnectionHandlers wouldn't need to use KeepAlive::Until as the default value to avoid connections being shut down instantly after they are established
  2. Our examples and tests wouldn't need to add keep_alive::Behaviour to work

Even determining whether a remote supports any of our protocols is difficult. We can only learn that through identify and if that works, a naive intersection of the protocols will at least yield that one.

@AgeManning
Copy link
Contributor

From a selfish perspective (gossipsub), it would seem to make sense that each behaviour has the ability to "disable" itself independently of other behaviours (such that for all intents and purposes a "DisableBehaviour" event looks the same as a "CloseConnection" at the behaviour level). I guess this is what the KeepAlive is attempting to do.

If a peer does something the behaviour doesn't like, it has the ability to drop all substreams and prevent new ones from being created. Other behaviours work independently until all become disabled and the connection gets closed altogether. Its been a while since I looked at the KeepAlive, but from memory it does this, but without the ability of disabling re-established substreams?

mergify bot pushed a commit that referenced this issue Apr 14, 2023
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection.

Instead of closing the connection, set `KeepAlive::No`.

Related: #3591.
Resolves: #3690.

Pull-Request: #3625.
tcoratger pushed a commit to tcoratger/rust-libp2p that referenced this issue Apr 14, 2023
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection.

Instead of closing the connection, set `KeepAlive::No`.

Related: libp2p#3591.
Resolves: libp2p#3690.

Pull-Request: libp2p#3625.
@thomaseizinger
Copy link
Contributor Author

We are going to land #3605 soon which simplifies error handling already a fair bit.

Plus, @mxinden had a good idea regarding keep-alive: Deprecate the Until mechanism. I've captured this here: #3844

mergify bot pushed a commit that referenced this issue May 24, 2023
Previously, the `libp2p-ping` module came with a policy to close a connection after X failed pings. This is only one of many possible policies on how users would want to do connection management.

We remove this policy without a replacement. If users wish to restore this functionality, they can easily implement such policy themselves: The default value of `max_failures` was 1. To restore the previous functionality users can simply close the connection upon the first received ping error.

In this same patch, we also simplify the API of `ping::Event` by removing the layer of `ping::Success` and instead reporting the RTT to the peer directly.

Related: #3591.

Pull-Request: #3947.
@thomaseizinger thomaseizinger changed the title swarm: Deprecate ConnectionHandlerEvent::Close swarm: Remove ConnectionHandlerEvent::Close Jun 27, 2023
@thomaseizinger thomaseizinger changed the title swarm: Remove ConnectionHandlerEvent::Close swarm: Remove ConnectionHandler::Error Sep 20, 2023
@thomaseizinger thomaseizinger added this to the v0.53.0 milestone Oct 24, 2023
@thomaseizinger
Copy link
Contributor Author

@mxinden I added this to the milestone but I'd also like to make another patch-release of libp2p-swarm where we deprecate ConnectionHandlerEvent::Close to give users a heads-up that we are going to remove this.

mergify bot pushed a commit that referenced this issue Oct 24, 2023
mergify bot pushed a commit that referenced this issue Oct 31, 2023
To make a reservation with a relay, a user calls `Swarm::listen_on` with an address of the relay, suffixed with a `/p2pcircuit` protocol. Similarly, to establish a circuit to another peer, a user needs to call `Swarm::dial` with such an address. Upon success, the `Swarm` then issues a `SwarmEvent::NewListenAddr` event in case of a successful reservation or a `SwarmEvent::ConnectionEstablished` in case of a successful connect.

The story is different for errors. Somewhat counterintuitively, the actual reason of an error during these operations are only reported as `relay::Event`s without a direct correlation to the user's `Swarm::listen_on` or `Swarm::dial` calls.

With this PR, we send these errors back "into" the `Transport` and report them as `SwarmEvent::ListenerClosed` or `SwarmEvent::OutgoingConnectionError`. This is conceptually more correct. Additionally, by sending these errors back to the transport, we no longer use `ConnectionHandlerEvent::Close` which entirely closes the underlying relay connection. In case the connection is not used for something else, it will be closed by the keep-alive algorithm.

Resolves: #4717.
Related: #3591.
Related: #4718.

Pull-Request: #4745.
mergify bot pushed a commit that referenced this issue Nov 1, 2023
We refactor the `libp2p-dcutr` API to only emit a single event: whether the hole-punch was successful or not. All other intermediate events are removed. Hole-punching is something that we try to do automatically as soon as we are connected to a peer over a relayed connection. The lack of explicit user intent means any event we emit is at best informational and not a "response" that the user would wait for.

Thus, I chose to not expose the details of why the hole-punch failed but return an opaque error.

Lastly, this PR also removes the usage of `ConnectionHandlerEvent::Close`. Just because something went wrong during the DCUtR handshake, doesn't mean we should close the relayed connection.

Related: #3591.

Pull-Request: #4749.
mergify bot pushed a commit that referenced this issue Nov 1, 2023
To remove the usages of `ConnectionHandlerEvent::Close` from the relay-server, we unify what used to be called `CircuitFailedReason` and `FatalUpgradeError`. Whilst the errors may be fatal for the particular circuit, they are not necessarily fatal for the entire connection.

Related: #3591.
Resolves: #4716.

Pull-Request: #4718.
@thomaseizinger thomaseizinger removed this from the v0.53.0 milestone Nov 2, 2023
@mergify mergify bot closed this as completed in #4755 Nov 2, 2023
@mergify mergify bot closed this as completed in 0ef6feb Nov 2, 2023
alindima pushed a commit to alindima/rust-libp2p that referenced this issue Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants