Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(iroh-net): Optimise present nodes in ActiveRelay #2781

Merged
merged 9 commits into from
Oct 3, 2024

Conversation

flub
Copy link
Contributor

@flub flub commented Oct 3, 2024

Description

The ActiveRelay actor keeps track of which remote nodes are present on the relay connection so that we can optimise relay connections to remote nodes. This does two main optimisations:

  • There were two sets of these nodes kept, they could easily be unified.

  • The set is best stored in a BTreeSet since they are simple NodeIds stored in them.

  • Bonus: rename peer to node to match our naming convention.

  • Bonus: identify nodes by NodeId since this is a routing key here.

Breaking Changes

Still none if all is well.

Notes & open questions

This targets #2779 as base.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • [ ] Tests if relevant.
  • [ ] All breaking changes documented.

These are two cleanups in the relay client:

- The `relay::Client` hands out a connection object when asked to
  connect.  This `Conn` was imported with rename to `RelayClient`
  which was a bit confusing as this was already the relay client.  It
  is now renamed to `RelayConn` which makes a lot more sense.  The
  related builder struct etc are renamed to match.

- The `relay::Client` had a counter for the number of connections made
  to the relay.  That seems fun, but was entirely unused.  If this is
  a useful thing to have it should probably be a counter metric
  instead but let's not add anything that no one is using.  Removing
  this makes a lot of APIs a bit simpler and removes some state
  tracking.
@flub flub requested a review from ramfox October 3, 2024 12:00
Copy link

github-actions bot commented Oct 3, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2781/docs/iroh/

Last updated: 2024-10-03T17:26:38Z

Copy link

github-actions bot commented Oct 3, 2024

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: aa7fc95

iroh-net/src/magicsock/relay_actor.rs Show resolved Hide resolved
iroh-net/src/magicsock/relay_actor.rs Show resolved Hide resolved
iroh-net/src/magicsock/relay_actor.rs Outdated Show resolved Hide resolved
iroh-net/src/magicsock/relay_actor.rs Outdated Show resolved Hide resolved
@flub flub force-pushed the flub/relay-actor-active-nodes branch from fd3545c to 8c594f5 Compare October 3, 2024 16:45
Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the original code and while the two sets were not exactly the same, I see no reason why this should work differently to the previous one. From my perspective this lgtm!

flub and others added 3 commits October 3, 2024 19:21
The ActiveRelay actor keeps track of which remote nodes are present
on the relay connection so that we can optimise relay connections to
remote nodes.  This does two main optimisations:

- There were two sets of these nodes kept, they could easily be
  unified.

- The set is best stored in a BTreeSet since they are simple
  NodeIds stored in them.

- Bonus: rename peer to node to match our naming convention.

- Bonus: identify nodes by NodeId since this is a routing key here.
@flub flub force-pushed the flub/relay-actor-active-nodes branch from 8c594f5 to 275b399 Compare October 3, 2024 17:24
Base automatically changed from flub/relay-client-cleanup-1 to main October 3, 2024 17:57
@flub flub added this pull request to the merge queue Oct 3, 2024
Merged via the queue into main with commit c7ac982 Oct 3, 2024
27 checks passed
@flub flub deleted the flub/relay-actor-active-nodes branch October 4, 2024 07:08
github-merge-queue bot pushed a commit that referenced this pull request Oct 4, 2024
## Description

When the connection to the relay server fails the read channel will
return a read error. At this point the ActiveRelay actor will passively
wait until it has been asked to send something again before it will
re-establish a connection.

However if the local node has no reason to send anything to the relay
server, the connection is never re-established. This is problematic when
the relay has remote nodes trying to send to this node. This doubly
problematic when the connection is to the home relay: the node just sits
there thinking everything is healty and quiet, but no traffic is
reaching it.

In a node with active traffic this doesn't really show up, since a send
will be triggered quickly for an active connection and the connection
with the relay server would be re-established.

The start of the ActiveRelay run loop is the right place for this. A
read error triggers the loop to go round, logs a read error already and
then re-estagblishes the connection.

This does not keep the relay connection open forever. The mechanism that
is cleans up
unused connections to relay servers will still function correctly since
this only takes
the time something was last sent to a relay server into account. As long
as a connection
with a remote node exists there will be a DISCO ping between the two
nodes over the relay
path, so the connection is correctly kept alive. The home relay is
exempted from the
relay connection cleanup so is also kept connected, leaving this node
available to be
contacted via the relay server. Which is the entire point of this
bugfix.

The relay_client.is_connected() call sends a message to the relay Client
actor, and relay_client.connect() does that again. Taking the shortcut
to only call .connect() however is not better because the logging
becomes messier. In the common case there is one roundrip-message to the
relay Client actor and this would not improve anyway. The two messages
for the case where a reconnect is needed does not occur commonly.

## Breaking Changes

None

## Notes & open questions

Fixes fishfolk/bones#428

It is rather difficult to test though.

This targets #2781 as base.

## Change checklist

- [x] Self-review.
- ~~[ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.~~
- ~~[ ] Tests if relevant.~~
- ~~[ ] All breaking changes documented.~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants