Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[base node] Header sync gets stuck on banned peer #5030

Closed
hansieodendaal opened this issue Dec 9, 2022 · 10 comments
Closed

[base node] Header sync gets stuck on banned peer #5030

hansieodendaal opened this issue Dec 9, 2022 · 10 comments
Assignees

Comments

@hansieodendaal
Copy link
Contributor

hansieodendaal commented Dec 9, 2022

This base node gets stuck trying to sync headers from banned peer cdee5a37e4478755e532a32d09 :

The ban incident:

2022-12-09 04:41:26.368536000 [c::val::helpers] WARN  Validation failed on block:4d0a304eed563a82389dda2b27318d138fda3c131a69b218859a726d64d9d6de:InvalidSignatureError("Metadata signature not valid!")
2022-12-09 04:41:26.368560500 [c::cs::database] WARN  Block #5287 (4d0a304eed563a82389dda2b27318d138fda3c131a69b218859a726d64d9d6de) failed validation - The transaction is invalid: Signature is invalid: Metadata signature not valid!
2022-12-09 04:41:26.368624400 [c::bn::comms_interface::inbound_handler] WARN  Peer cdee5a37e4478755e532a32d09 sent an invalid block: Validation error: The transaction is invalid: Signature is invalid: Metadata signature not valid!

Consecutive header sync retries to cdee5a37e4478755e532a32d09

   36537  2022-12-09 04:41:56.424062900 [c::bn::header_sync] DEBUG Starting header sync.
   36538  2022-12-09 04:41:56.424070100 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
   36539: 2022-12-09 04:41:56.424076700 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
   36540  2022-12-09 04:41:56.424085400 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
   36541  2022-12-09 04:41:56.424157600 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection

...
...
...

  116090  2022-12-09 09:22:11.320246000 [c::bn::header_sync] DEBUG Starting header sync.
  116091  2022-12-09 09:22:11.320252400 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
  116092: 2022-12-09 09:22:11.320258800 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
  116093  2022-12-09 09:22:11.320266300 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
  116094  2022-12-09 09:22:11.320336400 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection

  116127  2022-12-09 09:22:41.350432500 [c::bn::header_sync] DEBUG Starting header sync.
  116128  2022-12-09 09:22:41.350439900 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
  116129: 2022-12-09 09:22:41.350446300 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
  116130  2022-12-09 09:22:41.350453900 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
  116131  2022-12-09 09:22:41.350527400 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection

  116166  2022-12-09 09:23:11.381229500 [c::bn::header_sync] DEBUG Starting header sync.
  116167  2022-12-09 09:23:11.381235600 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
  116168: 2022-12-09 09:23:11.381241900 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
  116169  2022-12-09 09:23:11.381249600 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
  116170  2022-12-09 09:23:11.381320400 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection
@hansieodendaal hansieodendaal converted this from a draft issue Dec 9, 2022
@stringhandler stringhandler moved this from Bugs to Selected for development in Tari Esme Testnet Dec 9, 2022
@stringhandler stringhandler moved this from Selected for development to In Progress in Tari Esme Testnet Dec 9, 2022
@sdbondi
Copy link
Member

sdbondi commented Dec 9, 2022

Ah ok, so it's not that we still are connected to the banned peer but that the sync peer set provided by the listening state/chain metadata service is not clearing the banned peer from the list.

It should probably clear all peer metadatas when transitioning to header sync in every case so that listening has to do another round of metadata pingpongs and refreshes the list (just a suggestion to keep things simple, may not be the best way)

@sdbondi
Copy link
Member

sdbondi commented Dec 9, 2022

An interesting case here, i have a base node 99999 that is giving a higher chain (old genesis), but keeps on timing out on RPC connections. This isn't a bannable offense, so it continues to try. An RPC timeout can occur innocently or maliciously. Perhaps banning if it happens a few times is best, though that does mean keeping track of attempts.

@SWvheerden
Copy link
Collaborator

yeah I am very much in favor of having 3 ban levels.
1: very short only for us if a node times out.
2. Medium-term, like for the case described above, perhaps like a day or so.
3: Perma ban, for invalid consensus.

@hansieodendaal
Copy link
Contributor Author

Better story line

2022-12-09 04:05:48.682740200 [c::bn::chain_state_sync_service] TRACE Received ping from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:05:48.682760200 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5281, Acc_diff 67,363,806,366,009,233,370,378,187
2022-12-09 04:05:48.725431200 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
2022-12-09 04:05:48.725503900 [c::bn::header_sync] INFO  Successfully dialed sync peer cdee5a37e4478755e532a32d09 in 72.70µs
2022-12-09 04:05:48.725512100 [c::bn::header_sync] DEBUG Attempting to synchronize headers with `cdee5a37e4478755e532a32d09`
2022-12-09 04:05:51.469416000 [c::bn::chain_state_sync_service] TRACE Received ping from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:05:51.469434200 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5281, Acc_diff 67,363,806,366,009,233,370,378,187
2022-12-09 04:06:38.050929400 [c::bn::header_sync] DEBUG Header sync failed: Peer cdee5a37e4478755e532a32d09 exceeded maximum permitted sync latency. latency: 49.33ss, max: 30.00ss
2022-12-09 04:06:38.050952100 [c::bn::base_node] TRACE Base Node event in State [Synchronizing block headers]:  Header Synchronization Failed
2022-12-09 04:06:38.050964800 [c::bn::state_machine_service::states::waiting] INFO  The base node has started a WAITING state for 30 seconds
2022-12-09 04:06:38.052599200 [c::bn::chain_state_sync_service] TRACE Received pong from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:06:38.052611600 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5282, Acc_diff 67,364,252,593,290,559,176,378,768
2022-12-09 04:07:08.069426000 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
2022-12-09 04:07:08.069510200 [c::bn::header_sync] INFO  Successfully dialed sync peer cdee5a37e4478755e532a32d09 in 84.00µs
2022-12-09 04:07:08.069519000 [c::bn::header_sync] DEBUG Attempting to synchronize headers with `cdee5a37e4478755e532a32d09`
2022-12-09 04:07:14.802560400 [c::bn::header_sync] DEBUG Sync peer latency is 6.73s
2022-12-09 04:07:14.802577100 [tari_core::base_node::sync::header_sync::synchronizer] INFO  attempt_sync; sync_peer=SyncPeer { peer_metadata: PeerChainMetadata { node_id: NodeId(cdee5a37e4478755e532a32d09), chain_metadata: ChainMetadata { height_of_longest_chain: 5282, best_block: FixedHash([25, 102, 228, 155, 225, 135, 50, 65, 179, 51, 40, 18, 107, 24, 37, 199, 105, 177, 193, 203, 100, 122, 181, 128, 165, 164, 162, 83, 23, 255, 190, 189]), pruning_horizon: 0, pruned_height: 0, accumulated_difficulty: 67364252593290559176378768, timestamp: 1670551358 }, latency: Some(6.7328396s) }, avg_latency: RollingAverageTime { samples: RollingVec([]) } } max_latency=30s
2022-12-09 04:07:14.802595900 [c::bn::header_sync] DEBUG Initiating header sync with peer `cdee5a37e4478755e532a32d09` (sync latency = 6732ms)
2022-12-09 04:07:14.802931200 [c::bn::header_sync] DEBUG Determining if chain splits between 0 and 500 headers back from the tip (peer: `cdee5a37e4478755e532a32d09`, 54 hashes sent)
2022-12-09 04:07:23.774659000 [tari_core::base_node::sync::header_sync::synchronizer] ERROR error=Unable to find chain split from peer `cdee5a37e4478755e532a32d09`
2022-12-09 04:07:23.774677700 [c::bn::header_sync] WARN  Chain split not found for peer cdee5a37e4478755e532a32d09.
2022-12-09 04:07:23.774686000 [c::bn::header_sync] WARN  Banned sync peer because Peer could not find the location of a chain split
2022-12-09 04:07:23.774703600 [c::bn::header_sync] ERROR Header sync failed with all peers. Error: Sync failed for all peers
2022-12-09 04:07:23.774713900 [c::bn::header_sync] WARN  Sync failed for all peers. Continuing...
2022-12-09 04:07:23.774729100 [c::bn::base_node] TRACE Base Node event in State [Synchronizing block headers]:  Continuing
2022-12-09 04:07:23.774741300 [c::bn::state_machine_service::states::listening] INFO  Listening for chain metadata updates
2022-12-09 04:07:23.797852600 [c::bn::state_machine_service::states::listening] INFO  Our local blockchain accumulated difficulty is a little behind that of the network. We're at block #53 with an accumulated difficulty of 5,798,411,185,375,912,684, and the network chain tip is at #5282 with an accumulated difficulty of 67,364,252,593,290,559,176,378,768
2022-12-09 04:07:23.797875000 [c::bn::state_machine_service::states::listening] DEBUG Lagging (local height = 53, network height = 5282)
2022-12-09 04:07:23.797889000 [c::bn::base_node] TRACE Base Node event in State [Listening]:  Fallen behind main chain - Lagging behind 1 peers (#5282, Difficulty: 67364252593290559176378768)
2022-12-09 04:07:23.797902100 [c::bn::header_sync] DEBUG Starting header sync.
2022-12-09 04:07:23.797908800 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
2022-12-09 04:07:23.797915700 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
2022-12-09 04:07:23.797923300 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
2022-12-09 04:07:23.801827000 [c::bn::chain_state_sync_service] DEBUG Removing disconnected/banned peer `cdee5a37e4478755e532a32d09` from chain metadata list 
2022-12-09 04:07:23.802097900 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection
2022-12-09 04:07:23.802113900 [c::bn::base_node] TRACE Base Node event in State [Synchronizing block headers]:  Header Synchronization Failed
2022-12-09 04:07:23.802123900 [c::bn::state_machine_service::states::waiting] INFO  The base node has started a WAITING state for 30 seconds
2022-12-09 04:41:26.365429300 [c::bn::comms_interface::inbound_handler] DEBUG Block with hash `4d0a304eed563a82389dda2b27318d138fda3c131a69b218859a726d64d9d6de` is unknown. Constructing block from known mempool transactions / requesting missing transactions from peer 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:41:26.365450500 [c::bn::comms_interface::inbound_handler] INFO  Block #5287 (4d0a304eed563a82389dda2b27318d138fda3c131a69b218859a726d64d9d6de) received from remote peer: cdee5a37e4478755e532a32d09
2022-12-09 04:41:26.365457200 [c::bn::comms_interface::inbound_handler] DEBUG Incoming block: ----------------- Block -----------------
2022-12-09 04:41:26.366190300 [c::bn::chain_state_sync_service] TRACE Received ping from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:41:26.366216300 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5287, Acc_diff 67,379,859,023,864,670,431,623,680
2022-12-09 04:41:26.366229100 [c::bn::chain_state_sync_service] TRACE Received pong from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:41:26.366238300 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5287, Acc_diff 67,379,859,023,864,670,431,623,680
2022-12-09 04:41:26.368218800 [c::tx::aggregated_body] TRACE Checking sender signatures
2022-12-09 04:41:26.368511900 [dht::network_discovery::peer_validator] DEBUG Peer `92befafc5bdfc9f9e4ecf7d7e4` already exists or is up to date and will not be updated
2022-12-09 04:41:26.368536000 [c::val::helpers] WARN  Validation failed on block:4d0a304eed563a82389dda2b27318d138fda3c131a69b218859a726d64d9d6de:InvalidSignatureError("Metadata signature not valid!")
2022-12-09 04:41:26.368560500 [c::cs::database] WARN  Block #5287 (4d0a304eed563a82389dda2b27318d138fda3c131a69b218859a726d64d9d6de) failed validation - The transaction is invalid: Signature is invalid: Metadata signature not valid!
2022-12-09 04:41:26.368624400 [c::bn::comms_interface::inbound_handler] WARN  Peer cdee5a37e4478755e532a32d09 sent an invalid block: Validation error: The transaction is invalid: Signature is invalid: Metadata signature not valid!
2022-12-09 04:41:26.368669400 [dht::network_discovery::peer_validator] TRACE Adding peer `111111f21206d10c5063ef1501`
2022-12-09 04:41:26.368694100 [c::bn::base_node_service::service] ERROR Failed to handle incoming block message: Comms interface error: `Chain storage error: Validation error: The transaction is invalid: Signature is invalid: Metadata signature not valid!`
2022-12-09 04:41:26.368716200 [c::bn::chain_state_sync_service] TRACE Received ping from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:41:26.368737300 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5287, Acc_diff 67,379,859,023,864,670,431,623,680
2022-12-09 04:41:26.373708100 [c::bn::chain_state_sync_service] DEBUG Removing disconnected/banned peer `cdee5a37e4478755e532a32d09` from chain metadata list 
2022-12-09 04:41:26.376451800 [dht::network_discovery::peer_validator] DEBUG Peer `b0f4af30a3998bc362772a8eda` already exists or is up to date and will not be updated
2022-12-09 04:41:26.376543600 [c::bn::chain_state_sync_service] TRACE Received ping from neighbouring node 'cdee5a37e4478755e532a32d09'.
2022-12-09 04:41:26.376561600 [c::bn::chain_state_sync_service] DEBUG Received chain metadata from NodeId 'cdee5a37e4478755e532a32d09' #5287, Acc_diff 67,379,859,023,864,670,431,623,680
2022-12-09 04:41:26.395014000 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
2022-12-09 04:41:56.424062900 [c::bn::header_sync] DEBUG Starting header sync.
2022-12-09 04:41:56.424070100 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
2022-12-09 04:41:56.424076700 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
2022-12-09 04:41:56.424085400 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
2022-12-09 04:41:56.424157600 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection
2022-12-09 04:42:26.460843400 [c::bn::header_sync] DEBUG Starting header sync.
2022-12-09 04:42:26.460849700 [c::bn::header_sync] INFO  Synchronizing headers (1 candidate peers selected)
2022-12-09 04:42:26.460856000 [c::bn::header_sync] INFO  Attempting to sync headers (1 sync peers)
2022-12-09 04:42:26.460864400 [c::bn::header_sync] DEBUG Dialing cdee5a37e4478755e532a32d09 sync peer
2022-12-09 04:42:26.460935300 [c::bn::header_sync] DEBUG Header sync failed: Connectivity Error: ConnectionFailed: Peer is banned, denying connection
2022-12-09 04:42:26.460946900 [c::bn::base_node] TRACE Base Node event in State [Synchronizing block headers]:  Header Synchronization Failed

etc.

@hansieodendaal
Copy link
Contributor Author

Unzip each file
Open core.7.zip.001 with 7-zip

core.7.zip.001.zip
core.7.zip.002.zip
core.7.zip.003.zip
core.7.zip.004.zip

@sdbondi
Copy link
Member

sdbondi commented Dec 9, 2022

Please provide the comms logs if you can (in zip or tar.gz please :))

@hansieodendaal
Copy link
Contributor Author

Here you are @sdbondi

network.36.log.zip
network.37.log.zip

@SWvheerden
Copy link
Collaborator

So the core issue here looks like the BN bans the peer correctly, then removes the peer from the peer_chain_metadata list correctly.
But after this happens the peer still sends messages to the BN, and it gets added to peer_chain_metadata again, which makes the header sync try to sync to it again. But it can't send any request out because the peer is banned.

This is in a sense related to: #5037
but not entirely

So the root cause here is that the peer got banned, but its messages where still propagated up stream to the domain layer after the banned_event was sent to the domain layer.

@sdbondi
Copy link
Member

sdbondi commented Dec 12, 2022

Thanks for the logs,

What I've observed is that the peer is banned twice. The first time, for 30 minutes (between 04:07:23.802123900 and 2022-12-09 04:41:26.365429300) then again permanently (bad metadata sig). It's clear in the logs that the pong is received before being banned, the node is banned (chain metadata service clears the peer's metadata) but the message is already in the domain pipeline so another one is received. The chain metadata service now keeps the peer's chain metadata and sends it to the listening state every time because it is not cleared.

TL;DR classic race condition.

Order of events:

  • peer is banned for 30 minutes
  • peer ban expires
  • we receive a ping from the peer
  • at almost the same time peer is banned
  • chain metadata service clears the peer from peer_chain_metadata
  • the ping/pong is received (already in the pipeline from before the ban)
  • the chain metadata is added to the vec, and is never cleared (because the peer is not banned again so the event never re-occurs)
  • the peer stays in the list despite being banned and NOT connected, so the header sync continues to try to connect to it

I've made a PR that removes the Vec from the chain metadata service.

#5039

stringhandler pushed a commit that referenced this issue Dec 12, 2022
…5039)

Description
---
- Removes "caching" of peer metadata
- Send only one peer chain metadata at a time to the listening state
- increase chain metadata event channel to 20 (so that 20 of the last chain metadata received can be read by listening)

Motivation and Context
---
Fixes #5037 and #5030 (needs to be confirmed since this case is not easy to reproduce)

What I've observed is that the peer is banned twice. The first time, for 30 minutes (between 04:07:23.802123900 and 2022-12-09 04:41:26.365429300) then again permanently (bad metadata sig). It's clear in the logs that the pong is received before being banned, the node is banned (chain metadata service clears the peer's metadata) but the message is already in the domain pipeline so another one is received. The chain metadata service now keeps the peer's chain metadata and sends it to the listening state every time because it is not cleared.

TL;DR classic race condition.

Order of events:

- peer is banned for 30 minutes
- peer ban expires
- we receive a ping from the peer
- at almost the same time peer is banned (see logs in [this comment](#5030 (comment)))
- chain metadata service clears the peer from peer_chain_metadata
- the ping/pong is received (already in the pipeline from before the ban)
- the chain metadata is added to the vec, and is never cleared (because the peer is not banned again)
- the peer stays in the list despite being banned and NOT connected, so the header sync continues to try to connect to it


How Has This Been Tested?
---
Manually: rewind-blockchain and sync, enters sync mode timeously
Removed some tests that test removed functions
Added a new unit test for determine_sync_state
@stringhandler stringhandler moved this from In Progress to In Review in Tari Esme Testnet Dec 12, 2022
@SWvheerden
Copy link
Collaborator

SWvheerden commented Dec 12, 2022

mmm, It should not be a 30-minute ban, that's a bug as well

                Err(BlockHeaderSyncError::ChainSplitNotFound(peer)) => {
                    warn!(target: LOG_TARGET, "Chain split not found for peer {}.", peer);
                    self.ban_peer_long(&peer, BanReason::ChainSplitNotFound).await?;
                },

wait that's way too short:


            ban_period: Duration::from_secs(30 * 60),

Repository owner moved this from In Review to Done in Tari Esme Testnet Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants