Issues during block sync #11265

grbIzl · 2019-11-15T11:55:25Z

Purpose
This PR aims to fix issues in the following scenario during block sync:

There are two connected nodes A and F
Sync is going up to block N for both nodes
Node A receives notification from F about the new block with unknown parent. This notification can be caused by multiple reasons:
-- Fork happened and new subchain started from N block, in this case F sends notification about N'+1 block (and parent in N')
-- There were several "light" (light in terms of difficulty and import speed) blocks N+1, N+2, .. and one "heavy" block N+x after them in blocks import queue for F, in this case F may send notification only about "heavy" one N+x
-- Notifications about previous blocks were simple lost somehow
-- Some combination of all three
NOTE: Current block sync was designed with the assumption, that the node sends NewBlock message, only when the import of this block is completed. This assumption was changed by @tomusdrw in his Improve block and transaction propagation #9954, now the node sends this notification before importing blocks.
Node A starts retraction in order to find the beginning of possible fork, requesting headers for blocks N, N-1, N-2, N-3 etc from F
Meanwhile node F has not yet completed the import of the block, the notification about was sent to A, so it responds with old blocks, that do not consider new blocks, that are being imported
Node A quickly runs into N - 128 block (128 is the limit for number of requested headers)
Node F finishes new blocks import
Unfortunately retraction on node A went too far already and A requested 128 headers beginning from N - M block, where M > 128. As a result, F returns old 128 headers without new N' and N'+ 1
Also A doesn't receive the notifications about the new blocks from F (that could break the circle), because A constantly requests F, considers node F as "busy" and doesn't start new sync with it correspondingly.
Node A continues retraction to 0 and sync stops.

Addressed issues
This PR fixes several problems in handling of such scenario

Node F should not reply to GetBlockHeaders request until fork import is finished, because such answer won't be legit already. The reply is postponed in such case.
Node A should not immediately request the new block, because most probably its import is not finished yet. Instead, node A uses the parent block of came new block as a further reference point. Please note, that the same logic and assumption exists in geth: https://github.com/ethereum/go-ethereum/blob/master/eth/handler.go#L718
There was an infinity loop in retraction procedure, when it came to block number 0 (it's not a surprise, that this issue was not easily discovered before, because it only matters for archive nodes). Because of this loop the retraction was stucking around 0 block (and sync stopped in result).

Achieved results
So with all fixes in place the possibility of the described scenario is significantly decreased, the node A doesn't request blocks prematurely. Almost. Unfortunately, we observed situations during the testing, when retraction on node A still started and went behind the point of no return (block N - 128). This happened in pretty weird use cases of several forks in the row, when current block sync algorithm just simply could not handle such situation. In these cases I considered described retractions as legit, in spite of its predetermined result and required time (several minutes). I just fixed it in order to be properly restarted at the end, when no fork was found (item 3 in the list of addressed issues). As a result, the sync was paused for several minutes, but automatically recovered by itself and continued.

Comments
This problem is reproduced on configuration from #10724 when A and F run locally on one machine and A is only connected to F (F connected to the whole mainnet). As a result, network latency between A and F is low and A can quickly descend down to N - M block to the stuck condition. Also node A doesn't have any other sources of blocks but node F. For regular configurations stuck is less likely but it still causes overhead by sequences of GetBlock requests emitted by nodes, trying to find the common parent in such situation.

PS Huge shoutout to @iFA88 for the invaluable help in reproducing the scenarios and debugging

ethcore/verification/src/queue/mod.rs

ethcore/src/client/client.rs

ethcore/sync/src/chain/supplier.rs

tomusdrw · 2019-11-18T14:06:16Z

ethcore/verification/src/queue/mod.rs

+	/// May return false negative for longer queues
+	pub fn processing_fork(&self, best_block_hash: &H256) -> bool {
+		let processing = self.processing.read();
+		if processing.is_empty() || processing.len() > MAX_QUEUE_WITH_FORK {


So this optimization is introduced to avoid spending too much time traversing the entire processing map, right?
If we introduce a different mapping of ParentHash -> count_of_blocks_in_processing_that_build_on_top we could do it in constant time for any parent hash.

I implemented this approach, tested it and found one drawback: it doesn't work well in the case, when node is catching up blocks. In this case imported blocks are inserted into the queue one by one, as a result the suggested data structure contains N parent hashes with count 1. N could be pretty huge. So I considered it as a significant overhead. As I didn't find good solution to address both situations, I've decided to revert it and leave the current approach for now.

ordian · 2020-02-11T13:51:52Z

any updates on this?

grbIzl · 2020-02-19T14:36:15Z

any updates on this?

I prepared the changes, requested by @tomusdrw, but I need to test them first before returning to review phase

grbIzl · 2020-03-31T13:24:44Z

ethcore/sync/src/block_sync.rs

-								self.reset();
+						if n == 0 {
+							debug_sync!(self, "Header not found, bottom line reached, resetting, last imported: {}", self.last_imported_hash);
+							self.reset_to_block(&best_hash, best);


Unfortunately, here the calling of simple reset is not enough, because it resets counters (last_imported_block) to the current round. In this case the rounds went too far and simple rollback inside round doesn't help. reset_to_block here actually resets block sync to its initial state

grbIzl · 2020-03-31T13:25:20Z

ethcore/sync/src/chain/handler.rs

 		let block = Unverified::from_rlp(r.at(0)?.as_raw().to_vec())?;
 		let hash = block.header.hash();
 		let number = block.header.number();
 		trace!(target: "sync", "{} -> NewBlock ({})", peer_id, hash);
 		if number > sync.highest_block.unwrap_or(0) {
 			sync.highest_block = Some(number);
 		}
+		let parent_hash = block.header.parent_hash();
+		let difficulty: U256 = r.val_at(1)?;
+		// Most probably the sent block is being imported by peer right now


See item 2 in the addressed issues in PR's description

grbIzl · 2020-05-27T13:08:59Z

Rebased to the new master cc @vorot93

ethcore/sync/src/api.rs

tomusdrw · 2020-05-28T12:45:11Z

ethcore/sync/src/block_sync.rs

+		self.imported_this_round = None;
+		self.round_parents = VecDeque::new();
+		self.target_hash = None;
+		self.retract_step = 1;


What about self.blocks? It seems that reset_to does reset them too.

It's being reset inside self.reset() call.

ethcore/sync/src/block_sync.rs

tomusdrw · 2020-05-28T12:47:48Z

ethcore/sync/src/chain/handler.rs

 		let block = Unverified::from_rlp(r.at(0)?.as_raw().to_vec())?;
 		let hash = block.header.hash();
 		let number = block.header.number();
 		trace!(target: "sync", "{} -> NewBlock ({})", peer_id, hash);
 		if number > sync.highest_block.unwrap_or(0) {
 			sync.highest_block = Some(number);
 		}
+		let parent_hash = block.header.parent_hash();
+		let difficulty: U256 = r.val_at(1)?;


Why read that from RLP again if we have decoded block already? Isn't that simply part of the block.header?
EDIT: Ah, seems it's expected to be Total Difficulty, isn't that part of the decode struct anyway?

No, it's not. There is the usual Header there only. Also see some details in the the next question

tomusdrw · 2020-05-28T12:50:06Z

ethcore/sync/src/chain/handler.rs

+		let difficulty: U256 = r.val_at(1)?;
+		// Most probably the sent block is being imported by peer right now
+		// Use td and hash, that peer must have for now
+		let parent_td = difficulty - block.header.difficulty();


Would be good to add overflow protection

How is that non zero? Does RLP really contain total difficulty and the header just this block's difficulty? I thought that we don't really send TD over the wire.

Added.

Please refer to your code for sending NewBlock packet: https://github.com/openethereum/openethereum/pull/9954/files#diff-4cbf12eb53ed0891acc41ffd155a8930R478
As I've written in the description, I've looked through geth's logic for it as well. And they are doing exactly the same (send TD and extract it)

ethcore/sync/src/chain/mod.rs

tomusdrw · 2020-05-28T12:53:55Z

ethcore/sync/src/chain/mod.rs

@@ -672,6 +703,8 @@ pub struct ChainSync {
 	/// Connected peers pending Status message.
 	/// Value is request timestamp.
 	handshaking_peers: HashMap<PeerId, Instant>,
+	/// Requests, that can not be processed at the moment
+	delayed_requests: HashSet<(PeerId, u8, Vec<u8>)>,


Shouldn't we maintain ordering of the requests? With HashSet they will get randomized.

Reworked to vec. Standard option with linked_hash_map doesn't suit because of MallocSizeOf bound

tomusdrw · 2020-05-28T12:56:45Z

ethcore/sync/src/chain/supplier.rs

+		match response {
+			Err(e) => Err(e),
+			Ok(Some((packet_id, rlp_stream))) => {
+				io.respond(packet_id.id(), rlp_stream.out()).unwrap_or_else(
+					|e| debug!(target: "sync", "{:?}", error_func(e)));
+				Ok(())
+			}
+			_ => Ok(())
+		}


Suggested change

match response {

Err(e) => Err(e),

Ok(Some((packet_id, rlp_stream))) => {

io.respond(packet_id.id(), rlp_stream.out()).unwrap_or_else(

|e| debug!(target: "sync", "{:?}", error_func(e)));

Ok(())

}

_ => Ok(())

}

if let Some(packet_id, rlp_stream) = response? {

io.respond(packet_id.id(), rlp_stream.out()).unwrap_or_else(

|e| debug!(target: "sync", "{:?}", error_func(e)));

}

Ok(())

nit: That's a tiny bit shorter.

No )) It's shorter because you omitted Err processing.

the error is returned with ? after response, but TBH, it makes the code harder to read, so I would be in favor of keeping it as is

Aaa, sorry I missed this ?

ethcore/verification/src/queue/mod.rs

dvdplm · 2020-06-02T09:14:24Z

ethcore/sync/src/chain/supplier.rs

+	}
+
+	/// Dispatch delayed requests
+	/// The main difference with dispatch packet is the direct send of the responses to the peer


Should we check if the peer is still connected? Not sure if it's worth it but it seems possible they went away?

Added cleanup for disconnected peer

dvdplm · 2020-06-02T09:15:17Z

ethcore/sync/src/chain/supplier.rs

+				_ => {
+					debug!(target:"sync", "Unexpected packet {} was dispatched for delayed processing", packet_id);
+					Ok(())
+				}


Is this case really possible?

Right now, no indeed. But I've left this for future need, if some other packet is dispatched in postponed manner

dvdplm · 2020-06-02T09:23:43Z

ethcore/sync/src/chain/supplier.rs

-			})
+			match result {
+				Err(PacketProcessError::Decoder(e)) => debug!(target:"sync", "{} -> Malformed packet {} : {}", peer, packet_id, e),
+				Err(PacketProcessError::ClientBusy) => sync.write().add_delayed_request(peer, packet_id, data),


Maybe we can ensure that only GetBlockHeadersPackets get added here instead of performing that check when processing delayed requests?

I've tried to create a generic mechanism of packets with delayed response. IMO it's pretty useful and may be used in future

dvdplm · 2020-06-02T09:26:08Z

ethcore/sync/src/chain/mod.rs

@@ -672,6 +700,8 @@ pub struct ChainSync {
 	/// Connected peers pending Status message.
 	/// Value is request timestamp.
 	handshaking_peers: HashMap<PeerId, Instant>,
+	/// Requests, that can not be processed at the moment
+	delayed_requests: Vec<(PeerId, u8, Vec<u8>)>,


I am a bit worried that this is unbounded. How do we know it doesn't grow too big?

I addressed this by cleaning requests for disconnected peer (in terms of the other your question)

dvdplm · 2020-06-02T09:27:26Z

ethcore/sync/src/chain/mod.rs

@@ -821,6 +852,21 @@ impl ChainSync {
 		self.active_peers = self.peers.keys().cloned().collect();
 	}

+	/// Add a request for later processing
+	pub fn add_delayed_request(&mut self, peer: PeerId, packet_id: u8, data: &[u8]) {
+		if let Some(position) = self.delayed_requests.iter().position(|request| request.0 == peer && request.1 == packet_id) {


Would it be more efficient to use binary_search_by() here, to keep the vector sorted at all times?

I've made it a vector due to @tomusdrw very good suggestion to keep the order of insertions. So sorting won't help (if i understand your suggestion correctly).

Why is the deduplication needed anyway? Is data expected to be different? I'd just push unconditionally and rather restrict the size of delayed_requests to make sure we don't use up too much resources.

ethcore/sync/src/chain/handler.rs

tomusdrw · 2020-06-08T10:23:53Z

ethcore/sync/src/chain/mod.rs

@@ -821,6 +852,21 @@ impl ChainSync {
 		self.active_peers = self.peers.keys().cloned().collect();
 	}

+	/// Add a request for later processing
+	pub fn add_delayed_request(&mut self, peer: PeerId, packet_id: u8, data: &[u8]) {
+		if let Some(position) = self.delayed_requests.iter().position(|request| request.0 == peer && request.1 == packet_id) {


Why is the deduplication needed anyway? Is data expected to be different? I'd just push unconditionally and rather restrict the size of delayed_requests to make sure we don't use up too much resources.

grbIzl · 2020-06-08T14:27:09Z

Why is the deduplication needed anyway? Is data expected to be different? I'd just push unconditionally and rather restrict the size of delayed_requests to make sure we don't use up too much resources.

@tomusdrw I thought here about spamming prevention. If node receives too many requests in such mode (processing fork), some requests will be lost, if we simply limit the size of this container. It would be a weird attack ofc, but nevertheless

tomusdrw · 2020-06-18T10:31:46Z

I thought here about spamming prevention. If node receives too many requests in such mode (processing fork), some requests will be lost, if we simply limit the size of this container. It would be a weird attack ofc, but nevertheless

Right, with your solution you can still spam the node, but it's limited by number of peers * 256 (packet_id size) which seems reasonable.
Can we add an extra HashSet<(PeerId, PacketId)> to avoid doing linear check here? I feel the Vec might indeed have excessive CPU cost.

Apart from that looks okay to me.

…ueue

tomusdrw

looks good!

grbIzl added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Nov 15, 2019

grbIzl mentioned this pull request Nov 15, 2019

Reject GetBlockHeaders requests while importing blocks with possible fork #10895

Closed

tomusdrw reviewed Nov 18, 2019

View reviewed changes

grbIzl added A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. and removed A0-pleasereview 🤓 Pull request needs code review. labels Nov 20, 2019

ordian added the A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. label Feb 21, 2020

grbIzl changed the title ~~Pause GetBlockHeaders requests while importing blocks with possible fork~~ Issues during block sync Mar 31, 2020

grbIzl force-pushed the HandleImportOfForks2 branch from c1b5893 to 9fe53d6 Compare March 31, 2020 13:14

grbIzl commented Mar 31, 2020

View reviewed changes

grbIzl added A0-pleasereview 🤓 Pull request needs code review. and removed A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. labels Apr 2, 2020

ordian requested a review from tomusdrw May 25, 2020 08:45

grbIzl force-pushed the HandleImportOfForks2 branch from f1e0856 to 062cbec Compare May 27, 2020 13:07

vorot93 mentioned this pull request May 27, 2020

Goerli node in archive mode stops loading new blocks after fully synced #11730

Closed

tomusdrw reviewed May 28, 2020

View reviewed changes

dvdplm reviewed Jun 2, 2020

View reviewed changes

grbIzl requested review from dvdplm and tomusdrw June 8, 2020 08:24

tomusdrw reviewed Jun 8, 2020

View reviewed changes

grbIzl requested a review from tomusdrw June 17, 2020 11:31

grbIzl added 2 commits June 18, 2020 13:51

Pause GetBlockHeaders requests while importing blocks with possible fork

315f784

Lookup for the route instead of direct parent

7a062eb

grbIzl added 23 commits June 18, 2020 13:51

Check fork status in cycle

b5d25ec

Add requests to queue instead of waiting on mutex

7b71b69

Store amount of children for the hash in order optimize traverse of q…

e95f4f3

…ueue

Do not process empty queues

cfa0ad3

Use send method instead of respond

811bc0d

Revert of route usage and children counting

6852123

Retrieve parent hash and difficulty when new block comes

465a9ca

Reset block sync to the initial state, if non recovable issue occured

4f3b917

Fix after merge with head

8d7eb48

Proper derive_more used

8e654f6

Compute route instead of simple check

c5ebc29

Forgotten chain parameter added

2d10873

Checked sub and not public method

7753087

Replace set with vec

0bb5860

Response handling re-written

98d2885

Response handling re-written

1d239b2

Proper destructuring of items

01f9bd7

Comment wording changed

2e598ac

Method renamed

4ec07f3

Remove delayed requests for disconnected peer

d093b83

Use the correct default if difficulty is none

f7b709b

Add packets ids array for the lookup

65c92d4

Fix incorrect import after merge

4968d14

grbIzl force-pushed the HandleImportOfForks2 branch from e7f253a to 4968d14 Compare June 18, 2020 13:01

tomusdrw approved these changes Jun 30, 2020

View reviewed changes

vorot93 approved these changes Jun 30, 2020

View reviewed changes

vorot93 merged commit 63e2781 into master Jun 30, 2020

vorot93 deleted the HandleImportOfForks2 branch June 30, 2020 14:50

rakita mentioned this pull request Jul 7, 2020

Client occasionally stops importing new blocks #11758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues during block sync #11265

Issues during block sync #11265

grbIzl commented Nov 15, 2019 •

edited

Loading

tomusdrw Nov 18, 2019

grbIzl Apr 2, 2020

ordian commented Feb 11, 2020

grbIzl commented Feb 19, 2020

grbIzl Mar 31, 2020 •

edited

Loading

grbIzl Mar 31, 2020 •

edited

Loading

grbIzl commented May 27, 2020

tomusdrw May 28, 2020

grbIzl Jun 2, 2020

tomusdrw May 28, 2020

grbIzl Jun 2, 2020 •

edited

Loading

tomusdrw May 28, 2020

grbIzl Jun 2, 2020

tomusdrw May 28, 2020

grbIzl Jun 2, 2020

tomusdrw May 28, 2020

grbIzl Jun 2, 2020

ordian Jun 2, 2020

grbIzl Jun 2, 2020

dvdplm Jun 2, 2020

grbIzl Jun 2, 2020

dvdplm Jun 2, 2020

grbIzl Jun 2, 2020

dvdplm Jun 2, 2020

grbIzl Jun 2, 2020

dvdplm Jun 2, 2020

grbIzl Jun 2, 2020

dvdplm Jun 2, 2020

grbIzl Jun 2, 2020

tomusdrw Jun 8, 2020

tomusdrw Jun 8, 2020

grbIzl commented Jun 8, 2020

tomusdrw commented Jun 18, 2020

tomusdrw left a comment

Issues during block sync #11265

Issues during block sync #11265

Conversation

grbIzl commented Nov 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ordian commented Feb 11, 2020

grbIzl commented Feb 19, 2020

grbIzl Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

grbIzl Mar 31, 2020 • edited Loading

Choose a reason for hiding this comment

grbIzl commented May 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grbIzl Jun 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grbIzl commented Jun 8, 2020

tomusdrw commented Jun 18, 2020

tomusdrw left a comment

Choose a reason for hiding this comment

grbIzl commented Nov 15, 2019 •

edited

Loading

grbIzl Mar 31, 2020 •

edited

Loading

grbIzl Mar 31, 2020 •

edited

Loading

grbIzl Jun 2, 2020 •

edited

Loading