Skip to content

Commit

Permalink
fix(iroh-net): Keep the relay connection alive on read errors (#2782)
Browse files Browse the repository at this point in the history
## Description

When the connection to the relay server fails the read channel will
return a read error. At this point the ActiveRelay actor will passively
wait until it has been asked to send something again before it will
re-establish a connection.

However if the local node has no reason to send anything to the relay
server, the connection is never re-established. This is problematic when
the relay has remote nodes trying to send to this node. This doubly
problematic when the connection is to the home relay: the node just sits
there thinking everything is healty and quiet, but no traffic is
reaching it.

In a node with active traffic this doesn't really show up, since a send
will be triggered quickly for an active connection and the connection
with the relay server would be re-established.

The start of the ActiveRelay run loop is the right place for this. A
read error triggers the loop to go round, logs a read error already and
then re-estagblishes the connection.

This does not keep the relay connection open forever. The mechanism that
is cleans up
unused connections to relay servers will still function correctly since
this only takes
the time something was last sent to a relay server into account. As long
as a connection
with a remote node exists there will be a DISCO ping between the two
nodes over the relay
path, so the connection is correctly kept alive. The home relay is
exempted from the
relay connection cleanup so is also kept connected, leaving this node
available to be
contacted via the relay server. Which is the entire point of this
bugfix.

The relay_client.is_connected() call sends a message to the relay Client
actor, and relay_client.connect() does that again. Taking the shortcut
to only call .connect() however is not better because the logging
becomes messier. In the common case there is one roundrip-message to the
relay Client actor and this would not improve anyway. The two messages
for the case where a reconnect is needed does not occur commonly.

## Breaking Changes

None

## Notes & open questions

Fixes fishfolk/bones#428

It is rather difficult to test though.

This targets #2781 as base.

## Change checklist

- [x] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- ~~[ ] All breaking changes documented.~~
  • Loading branch information
flub authored Oct 4, 2024
1 parent c7ac982 commit 383f1f9
Showing 1 changed file with 11 additions and 3 deletions.
14 changes: 11 additions & 3 deletions iroh-net/src/magicsock/relay_actor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,13 @@ impl ActiveRelay {
.context("initial connection")?;

loop {
// If a read error occurred on the connection it might have been lost. But we
// need this connection to stay alive so we can receive more messages sent by
// peers via the relay even if we don't start sending again first.
if !self.relay_client.is_connected().await? {
debug!("relay re-connecting");
self.relay_client.connect().await.context("keepalive")?;
}
tokio::select! {
Some(msg) = inbox.recv() => {
trace!("tick: inbox: {:?}", msg);
Expand Down Expand Up @@ -136,7 +143,7 @@ impl ActiveRelay {
r.send(client).ok();
}
ActiveRelayMessage::Shutdown => {
self.relay_client.close().await.ok();
debug!("shutdown");
break;
}
}
Expand All @@ -146,17 +153,18 @@ impl ActiveRelay {
if let Some(msg) = msg {
if self.handle_relay_msg(msg).await == ReadResult::Break {
// fatal error
self.relay_client.close().await.ok();
break;
}
}
}
else => {
debug!("all clients closed");
break;
}
}
}

debug!("exiting");
self.relay_client.close().await?;
Ok(())
}

Expand Down

0 comments on commit 383f1f9

Please sign in to comment.