Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separates out routing shreds from establishing connections #33599

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

behzadnouri
Copy link
Contributor

Problem

Currently each outgoing shred will attempt to establish a connection if one does not already exist. This is very wasteful and consumes many tokio tasks if the remote node is down or unresponsive.

Summary of Changes

The commit decouples routing packets from establishing connections by adding a buffering channel for each remote address. Outgoing packets are always sent down this channel to be processed once the connection is established. If connecting attempt fails, all packets already pushed to the channel are dropped at once, reducing the number of attempts to make a connection if the remote node is down or unresponsive.

@behzadnouri behzadnouri force-pushed the turbine-quic-router branch 2 times, most recently from f3572dc to c0ffb78 Compare October 9, 2023 18:33
@codecov
Copy link

codecov bot commented Oct 9, 2023

Codecov Report

Merging #33599 (e3d6faa) into master (ac788ab) will increase coverage by 0.0%.
The diff coverage is 89.1%.

@@           Coverage Diff           @@
##           master   #33599   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         806      806           
  Lines      217588   217612   +24     
=======================================
+ Hits       178106   178133   +27     
+ Misses      39482    39479    -3     

Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the extra routing layer, would it be possible to leave a "tombstone" in the cache that would indicate that we previously tried & failed to establish a connection and shouldn't try again?

};
let receiver = {
let mut router = router.write().await;
let bytes = match router.get(&remote_address) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Was going to recommend trying to de-duplicate this block as it is nearly identical to the block above, but with the continue statement to control loop flow, I don't see a great way to do so unfortunately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a helper function to reduce the amount of duplicate code.

turbine/src/quic_endpoint.rs Show resolved Hide resolved
Currently each outgoing shred will attempt to establish a connection if
one does not already exist. This is very wasteful and consumes many
tokio tasks if the remote node is down or unresponsive.

The commit decouples routing packets from establishing connections by
adding a buffering channel for each remote address. Outgoing packets are
always sent down this channel to be processed once the connection is
established. If connecting attempt fails, all packets already pushed to
the channel are dropped at once, reducing the number of attempts to make
a connection if the remote node is down or unresponsive.
@behzadnouri
Copy link
Contributor Author

Instead of the extra routing layer, would it be possible to leave a "tombstone" in the cache that would indicate that we previously tried & failed to establish a connection and shouldn't try again?

How do we decide to retry connection in that case? i.e. when and how the tombstone gets cleared?

An advantage of this routing layer is that the connection cache also simplifies from

HashMap<(SocketAddr, Option<Pubkey>), Arc<RwLock<Option<Connection>>>>

to

HashMap<Pubkey, Connection>

which makes the follow up patch for cache eviction much simpler.

@steviez
Copy link
Contributor

steviez commented Oct 16, 2023

How do we decide to retry connection in that case? i.e. when and how the tombstone gets cleared?

Hypothetically, the tombstone could contain a timestamp and we retry if the tombstone has reached some predefined age.

An advantage of this routing layer is that the connection cache also simplifies from
...
which makes the follow up patch for cache eviction much simpler.

Fair enough. I'll take another another pass at this tomorrow

@behzadnouri behzadnouri merged commit 8becb72 into solana-labs:master Oct 19, 2023
31 checks passed
@behzadnouri behzadnouri deleted the turbine-quic-router branch October 19, 2023 15:44
@behzadnouri behzadnouri added the v1.17 PRs that should be backported to v1.17 label Oct 19, 2023
mergify bot pushed a commit that referenced this pull request Oct 19, 2023
Currently each outgoing shred will attempt to establish a connection if
one does not already exist. This is very wasteful and consumes many
tokio tasks if the remote node is down or unresponsive.

The commit decouples routing packets from establishing connections by
adding a buffering channel for each remote address. Outgoing packets are
always sent down this channel to be processed once the connection is
established. If connecting attempt fails, all packets already pushed to
the channel are dropped at once, reducing the number of attempts to make
a connection if the remote node is down or unresponsive.

(cherry picked from commit 8becb72)
Copy link
Contributor

@joncinque joncinque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, but just one question

Comment on lines +255 to +259
let receiver = {
let (sender, receiver) = tokio::sync::mpsc::channel(ROUTER_CHANNEL_BUFFER);
router.write().await.insert(remote_address, sender);
receiver
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the server task always creating a new channel, while the client task reuses one if it exists? Is it possible for the client side to have already tried to initiate a connection to that remote address? If that's the case, it looks like the server side would be clobbering the previous channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server side does not initiate a connection, it only accepts incoming connections from remote nodes.

If there is already a connection and for whatever reason the remote node initiates a new connection, then yes, it will drop the previous connection and replace it with the new one. This happens both in the router hash-map here, and the cache:
https://github.com/solana-labs/solana/blob/dc3c82729/turbine/src/quic_endpoint.rs#L407-L409

We can possibly allow multiple connections per pubkey by having a Vec<Connection> instead of a single Connection, but for now I think a single Connection per pubkey would be simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, that explains it, thanks! No need to have a Vec<Connection> -- resetting on a new remote connection makes sense.

behzadnouri added a commit that referenced this pull request Oct 20, 2023
…ckport of #33599) (#33772)

separates out routing shreds from establishing connections (#33599)

Currently each outgoing shred will attempt to establish a connection if
one does not already exist. This is very wasteful and consumes many
tokio tasks if the remote node is down or unresponsive.

The commit decouples routing packets from establishing connections by
adding a buffering channel for each remote address. Outgoing packets are
always sent down this channel to be processed once the connection is
established. If connecting attempt fails, all packets already pushed to
the channel are dropped at once, reducing the number of attempts to make
a connection if the remote node is down or unresponsive.

(cherry picked from commit 8becb72)

Co-authored-by: behzad nouri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants