Report RPC Errors to the application on peer disconnections #5680

dapplion · 2024-05-01T08:23:11Z

Issue Addressed

Extends

Report RPC Errors to the application on peer disconnections #5658

As we are overhaling some internal RPC infrastructure, a desired feature is to report peer disconnects on RPC requests.

This PR should report an RPCError(Disconnected) if a connection is terminated whilst an RPC request is underway.

Proposed Changes

Emit RPCError::Disconnect to any outbound streams with a disconnecting peer
Remove sync code that fails download attempts of disconnected peers, and expect inject_error to handle it

Co-authored-by: Age Manning <[email protected]>

Squashed commit of the following: commit f5dc1a3 Author: dapplion <[email protected]> Date: Wed May 1 17:14:50 2024 +0900 Expect RPCError::Disconnect to fail ongoing requests commit 888f129 Author: dapplion <[email protected]> Date: Wed May 1 14:14:22 2024 +0900 Report RPC Errors to the application on peer disconnections Co-authored-by: Age Manning <[email protected]>

pawanjay176

This is a great catch, I just have a question

pawanjay176 · 2024-05-01T22:00:54Z

beacon_node/network/src/sync/backfill_sync/mod.rs

        if matches!(
            self.state(),
            BackFillState::Failed | BackFillState::NotRequired
        ) {
            return Ok(());
        }

-        if let Some(batch_ids) = self.active_requests.remove(peer_id) {


This is a great simplification. The repeated logic was a source of many seen/unseen bugs earlier. Kudos for having the big picture and spotting that this can be removed 🙌

My only concern with this is that in inject_error, we call batch.download_failed(true) instead of false which we shouldn't be doing for disconnections maybe? Repeated peer disconnections for the same batch might end up marking the entire chain as invalid and redoing a bunch of stuff.

Good point. Should also be noted that these disconnects are ungraceful.

i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect.

The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error).

Right, current stable assumes that peer disconnection == failed download. But this RPCError::Disconnect error will only fire if there is an active outgoing request that gets terminated ungracefully.

Repeated peer disconnections for the same batch might end up marking the entire chain as invalid and redoing a bunch of stuff.

Should only apply if:

we initiate request to peer A

peer A disconnects ungracefully before completing request

we initiate a retry request to peer B

peer B disconnects ungracefully before completing request

This should not happen frequently

fair enough

pawanjay176 · 2024-05-01T22:49:43Z

beacon_node/network/src/sync/range_sync/chain.rs

-            for id in batch_ids {
-                if let Some(batch) = self.batches.get_mut(&id) {
-                    if let BatchOperationOutcome::Failed { blacklist } =
-                        batch.download_failed(true)?


similar comment as above regarding marking the batch as failed/not failed.

In forward sync, this might mean potentially not retrying a valid chain because the peers on the good chain are disconnecting.

pawanjay176 · 2024-05-01T22:52:03Z

beacon_node/lighthouse_network/tests/rpc_tests.rs

+                        sender.send_request(peer_id, 42, rpc_request.clone());
+                    }
+                    NetworkEvent::RPCFailed { error, id: 42, .. } => match error {
+                        RPCError::Disconnected => return,


the test should make sure we only get to this branch

Since this branch is the only way to break out of the loop, not hitting this branch will timeout the test. I considered adding something explicit but it feels redundant

AgeManning

I did a quick look. This looks good to me. I like the simplification.

The errors that we now see are from ungraceful disconnects, which probably should be punished and treated like an RPC error, imo

AgeManning · 2024-05-02T01:03:29Z

beacon_node/network/src/sync/backfill_sync/mod.rs

        if matches!(
            self.state(),
            BackFillState::Failed | BackFillState::NotRequired
        ) {
            return Ok(());
        }

-        if let Some(batch_ids) = self.active_requests.remove(peer_id) {


Good point. Should also be noted that these disconnects are ungraceful.

i.e When lighthouse disconnects from a peer, it will wait to try and fulfill all its requests. It wont just drop the connection. In fact, a stream timeout will occur before a disconnection in a graceful disconnect.

The error peer disconnect should only happen when a peer drops the connection without fulfilling a request (which lighthouse doesn't do unless there is a network error).

Squashed commit of the following: commit f5dc1a3 Author: dapplion <[email protected]> Date: Wed May 1 17:14:50 2024 +0900 Expect RPCError::Disconnect to fail ongoing requests commit 888f129 Author: dapplion <[email protected]> Date: Wed May 1 14:14:22 2024 +0900 Report RPC Errors to the application on peer disconnections Co-authored-by: Age Manning <[email protected]>

dapplion · 2024-05-03T01:19:04Z

@realbigsean noted that a lookup can get stuck if it has no available peers and is awaiting a download. This case should never happen with current code as a lookup is never left in AwaitingDownload state. However, for completeness I have added a check to drop lookups in that case in 02f1b2d

Squashed commit of the following: commit 02f1b2d Author: dapplion <[email protected]> Date: Fri May 3 10:17:42 2024 +0900 Drop lookups after peer disconnect and not awaiting events commit f5dc1a3 Author: dapplion <[email protected]> Date: Wed May 1 17:14:50 2024 +0900 Expect RPCError::Disconnect to fail ongoing requests commit 888f129 Author: dapplion <[email protected]> Date: Wed May 1 14:14:22 2024 +0900 Report RPC Errors to the application on peer disconnections Co-authored-by: Age Manning <[email protected]>

dapplion · 2024-05-03T08:48:10Z

The RPCError events are never received by sync due to this condition here

lighthouse/beacon_node/lighthouse_network/src/service/mod.rs

Lines 1370 to 1377 in 3058b96

    
           if !self.peer_manager().is_connected(&peer_id) { 
        
               debug!( 
        
                   self.log, 
        
                   "Ignoring rpc message of disconnecting peer"; 
        
                   event 
        
               ); 
        
               return None; 
        
           }

The tests on network/sync/block_lookups and lighthouse_network pass respectively as they don't test the full integration.

@AgeManning there's a lot of code in this function that is currently not expecting events for disconnected peers. Would be best to just allow events for disconnected peers if the event type if RPCError::Disconnect?

AgeManning

The latest changes allowing the RpcError::Disconnected to propagate, looks good to me.

To be clear, this edge case happens where:

We make a request
The peer disconnects ungracefully
There is a race between receiving the RpcError::Disconnected from the rpc handler and the Swarm peer-discconected message. Potentially we always lose this race, and the peer manager considers the peer disconnected before we can read the final handler message.

The only issue I see here, is that it's going to break a previously intuitive construct we previously had, which was that the last log/message we ever see from a peer is "Peer Disconnected".

After this change, we can see logs and messages after the peer has disconnected.

i.e
"Peer Disconnected"
"RPC Error::Disconnected"

I don't immediately see a solution to this, because the ordering is coming from the swarm which we don't have much control over here.

beacon_node/lighthouse_network/src/service/mod.rs

Co-authored-by: Age Manning <[email protected]>

realbigsean · 2024-05-06T17:16:19Z

@mergify queue

mergify · 2024-05-06T17:16:31Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at b87c36a

dapplion and others added 2 commits May 1, 2024 17:20

Report RPC Errors to the application on peer disconnections

888f129

Co-authored-by: Age Manning <[email protected]>

Expect RPCError::Disconnect to fail ongoing requests

f5dc1a3

dapplion mentioned this pull request May 1, 2024

Release v5.2.0 #5664

Merged

dapplion requested a review from AgeManning May 1, 2024 12:24

realbigsean added ready-for-review The code is ready for review v5.2.0 Q2 2024 labels May 1, 2024

pawanjay176 reviewed May 1, 2024

View reviewed changes

pawanjay176 added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 1, 2024

AgeManning approved these changes May 2, 2024

View reviewed changes

dapplion mentioned this pull request May 2, 2024

Report RPC Errors to the application on peer disconnections #5658

Closed

Drop lookups after peer disconnect and not awaiting events

02f1b2d

pawanjay176 approved these changes May 3, 2024

View reviewed changes

pawanjay176 removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 3, 2024

pawanjay176 mentioned this pull request May 5, 2024

single_block_lookups leak #5694

Open

Allow RPCError disconnect through network service

a8d21e1

AgeManning approved these changes May 6, 2024

View reviewed changes

AgeManning reviewed May 6, 2024

View reviewed changes

beacon_node/lighthouse_network/src/service/mod.rs Outdated Show resolved Hide resolved

realbigsean and others added 2 commits May 6, 2024 10:01

Update beacon_node/lighthouse_network/src/service/mod.rs

b0fe7bc

Co-authored-by: Age Manning <[email protected]>

Merge branch 'unstable' into rpc-error-on-disconnect

568db57

mergify bot merged commit b87c36a into sigp:unstable May 6, 2024
27 checks passed

dapplion deleted the rpc-error-on-disconnect branch May 7, 2024 02:05

pawanjay176 mentioned this pull request Jun 20, 2024

Remove all batches related to a peer on disconnect #5969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report RPC Errors to the application on peer disconnections #5680

Report RPC Errors to the application on peer disconnections #5680

dapplion commented May 1, 2024

pawanjay176 left a comment

pawanjay176 May 1, 2024

AgeManning May 2, 2024

dapplion May 2, 2024

pawanjay176 May 3, 2024

pawanjay176 May 1, 2024

pawanjay176 May 1, 2024

dapplion May 2, 2024

pawanjay176 May 3, 2024

AgeManning left a comment

AgeManning May 2, 2024

dapplion commented May 3, 2024

dapplion commented May 3, 2024

AgeManning left a comment •

edited

Loading

realbigsean commented May 6, 2024

mergify bot commented May 6, 2024 •

edited

Loading

Report RPC Errors to the application on peer disconnections #5680

Report RPC Errors to the application on peer disconnections #5680

Conversation

dapplion commented May 1, 2024

Issue Addressed

Proposed Changes

pawanjay176 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AgeManning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dapplion commented May 3, 2024

dapplion commented May 3, 2024

AgeManning left a comment • edited Loading

Choose a reason for hiding this comment

realbigsean commented May 6, 2024

mergify bot commented May 6, 2024 • edited Loading

✅ The pull request has been merged automatically

AgeManning left a comment •

edited

Loading

mergify bot commented May 6, 2024 •

edited

Loading