separates out routing shreds from establishing connections #33599

behzadnouri · 2023-10-09T18:26:59Z

Problem

Currently each outgoing shred will attempt to establish a connection if one does not already exist. This is very wasteful and consumes many tokio tasks if the remote node is down or unresponsive.

Summary of Changes

The commit decouples routing packets from establishing connections by adding a buffering channel for each remote address. Outgoing packets are always sent down this channel to be processed once the connection is established. If connecting attempt fails, all packets already pushed to the channel are dropped at once, reducing the number of attempts to make a connection if the remote node is down or unresponsive.

codecov · 2023-10-09T19:35:27Z

Codecov Report

Merging #33599 (e3d6faa) into master (ac788ab) will increase coverage by 0.0%.
The diff coverage is 89.1%.

@@           Coverage Diff           @@
##           master   #33599   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         806      806           
  Lines      217588   217612   +24     
=======================================
+ Hits       178106   178133   +27     
+ Misses      39482    39479    -3

steviez

Instead of the extra routing layer, would it be possible to leave a "tombstone" in the cache that would indicate that we previously tried & failed to establish a connection and shouldn't try again?

steviez · 2023-10-12T06:25:47Z

turbine/src/quic_endpoint.rs

+        };
+        let receiver = {
+            let mut router = router.write().await;
+            let bytes = match router.get(&remote_address) {


nit: Was going to recommend trying to de-duplicate this block as it is nearly identical to the block above, but with the continue statement to control loop flow, I don't see a great way to do so unfortunately

added a helper function to reduce the amount of duplicate code.

turbine/src/quic_endpoint.rs

Currently each outgoing shred will attempt to establish a connection if one does not already exist. This is very wasteful and consumes many tokio tasks if the remote node is down or unresponsive. The commit decouples routing packets from establishing connections by adding a buffering channel for each remote address. Outgoing packets are always sent down this channel to be processed once the connection is established. If connecting attempt fails, all packets already pushed to the channel are dropped at once, reducing the number of attempts to make a connection if the remote node is down or unresponsive.

behzadnouri · 2023-10-12T16:14:45Z

Instead of the extra routing layer, would it be possible to leave a "tombstone" in the cache that would indicate that we previously tried & failed to establish a connection and shouldn't try again?

How do we decide to retry connection in that case? i.e. when and how the tombstone gets cleared?

An advantage of this routing layer is that the connection cache also simplifies from

HashMap<(SocketAddr, Option<Pubkey>), Arc<RwLock<Option<Connection>>>>

to

HashMap<Pubkey, Connection>

which makes the follow up patch for cache eviction much simpler.

steviez · 2023-10-16T05:16:49Z

How do we decide to retry connection in that case? i.e. when and how the tombstone gets cleared?

Hypothetically, the tombstone could contain a timestamp and we retry if the tombstone has reached some predefined age.

An advantage of this routing layer is that the connection cache also simplifies from
...
which makes the follow up patch for cache eviction much simpler.

Fair enough. I'll take another another pass at this tomorrow

Currently each outgoing shred will attempt to establish a connection if one does not already exist. This is very wasteful and consumes many tokio tasks if the remote node is down or unresponsive. The commit decouples routing packets from establishing connections by adding a buffering channel for each remote address. Outgoing packets are always sent down this channel to be processed once the connection is established. If connecting attempt fails, all packets already pushed to the channel are dropped at once, reducing the number of attempts to make a connection if the remote node is down or unresponsive. (cherry picked from commit 8becb72)

joncinque

Sorry for the late review, but just one question

joncinque · 2023-10-20T12:09:21Z

turbine/src/quic_endpoint.rs

+    let receiver = {
+        let (sender, receiver) = tokio::sync::mpsc::channel(ROUTER_CHANNEL_BUFFER);
+        router.write().await.insert(remote_address, sender);
+        receiver
+    };


Why is the server task always creating a new channel, while the client task reuses one if it exists? Is it possible for the client side to have already tried to initiate a connection to that remote address? If that's the case, it looks like the server side would be clobbering the previous channel

The server side does not initiate a connection, it only accepts incoming connections from remote nodes.

If there is already a connection and for whatever reason the remote node initiates a new connection, then yes, it will drop the previous connection and replace it with the new one. This happens both in the router hash-map here, and the cache:
https://github.com/solana-labs/solana/blob/dc3c82729/turbine/src/quic_endpoint.rs#L407-L409

We can possibly allow multiple connections per pubkey by having a Vec<Connection> instead of a single Connection, but for now I think a single Connection per pubkey would be simpler.

Ok great, that explains it, thanks! No need to have a Vec<Connection> -- resetting on a new remote connection makes sense.

…ckport of #33599) (#33772) separates out routing shreds from establishing connections (#33599) Currently each outgoing shred will attempt to establish a connection if one does not already exist. This is very wasteful and consumes many tokio tasks if the remote node is down or unresponsive. The commit decouples routing packets from establishing connections by adding a buffering channel for each remote address. Outgoing packets are always sent down this channel to be processed once the connection is established. If connecting attempt fails, all packets already pushed to the channel are dropped at once, reducing the number of attempts to make a connection if the remote node is down or unresponsive. (cherry picked from commit 8becb72) Co-authored-by: behzad nouri <[email protected]>

behzadnouri force-pushed the turbine-quic-router branch 2 times, most recently from f3572dc to c0ffb78 Compare October 9, 2023 18:33

behzadnouri force-pushed the turbine-quic-router branch from c0ffb78 to 8910c50 Compare October 9, 2023 19:37

behzadnouri requested review from joncinque and steviez October 9, 2023 19:38

behzadnouri force-pushed the turbine-quic-router branch 3 times, most recently from 2cd44ce to e7fd2c2 Compare October 11, 2023 16:48

steviez reviewed Oct 12, 2023

View reviewed changes

behzadnouri force-pushed the turbine-quic-router branch from e7fd2c2 to e3d6faa Compare October 12, 2023 15:54

behzadnouri requested a review from steviez October 12, 2023 16:14

steviez approved these changes Oct 19, 2023

View reviewed changes

behzadnouri merged commit 8becb72 into solana-labs:master Oct 19, 2023
31 checks passed

behzadnouri deleted the turbine-quic-router branch October 19, 2023 15:44

behzadnouri added the v1.17 PRs that should be backported to v1.17 label Oct 19, 2023

mergify bot mentioned this pull request Oct 19, 2023

v1.17: separates out routing shreds from establishing connections (backport of #33599) #33772

Merged

joncinque reviewed Oct 20, 2023

View reviewed changes

HaoranYi mentioned this pull request Apr 8, 2024

pr634 1.17 new filter anza-xyz/agave#657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separates out routing shreds from establishing connections #33599

separates out routing shreds from establishing connections #33599

behzadnouri commented Oct 9, 2023

codecov bot commented Oct 9, 2023 •

edited

Loading

steviez left a comment

steviez Oct 12, 2023

behzadnouri Oct 12, 2023

behzadnouri commented Oct 12, 2023

steviez commented Oct 16, 2023

joncinque left a comment

joncinque Oct 20, 2023

behzadnouri Oct 20, 2023

joncinque Oct 20, 2023

separates out routing shreds from establishing connections #33599

separates out routing shreds from establishing connections #33599

Conversation

behzadnouri commented Oct 9, 2023

Problem

Summary of Changes

codecov bot commented Oct 9, 2023 • edited Loading

Codecov Report

steviez left a comment

Choose a reason for hiding this comment

steviez Oct 12, 2023

Choose a reason for hiding this comment

behzadnouri Oct 12, 2023

Choose a reason for hiding this comment

behzadnouri commented Oct 12, 2023

steviez commented Oct 16, 2023

joncinque left a comment

Choose a reason for hiding this comment

joncinque Oct 20, 2023

Choose a reason for hiding this comment

behzadnouri Oct 20, 2023

Choose a reason for hiding this comment

joncinque Oct 20, 2023

Choose a reason for hiding this comment

codecov bot commented Oct 9, 2023 •

edited

Loading