Changes to handling partial responses #1686

rahulksnv · 2023-09-24T19:58:34Z

Problem: When downloading from a fork, the peer may send a partial response that extends the current chain. Even though the response starts at a block number after the current best number, the parent block from the fork is not yet downloaded/imported. We currently try to import the partial response in this scenario, which causes import to fail as parent is not found in the backend. This causes sync to be aborted/restarted which may not converge.

Fix: when we receive a partial response in this scenario, hold on to the partial blocks and try to download the missing blocks. And import when the full range is downloaded.

Context: paritytech#493 (comment) Problem: When downloading from a fork, the peer may send a partial response that extends the current chain. Even though the response starts at a block number after the current best number, the parent block from the fork is not yet downloaded/imported. We currently try to import the partial response in this scenario, which causes import to fail as parent is not found in the backend. This causes sync to be aborted/restarted which may not converge. Fix: when we receive a partial response in this scenario, hold on to the partial blocks and try to download the missing blocks. And import when the full range is downloaded.

altonen · 2023-09-26T13:53:57Z

bot fmt

command-bot · 2023-09-26T13:54:02Z

@altonen https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-42f79efd-39db-4927-b68e-3f722f8806f4 to cancel this command or bot cancel to cancel all commands in this pull request.

command-bot · 2023-09-26T13:55:02Z

@altonen Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758/artifacts/download.

rahulksnv · 2023-09-26T13:56:59Z

bot fmt

I don't understand how the fmt works with these changes. I try running cargo fmt locally (or with the command from the failing fmt job), but it formats several files not part of the change, or most of code in same file that the PR does not touch. The contributor guidelines also does not talk about this much.

altonen · 2023-09-26T13:58:57Z

@rahulksnv did you run cargo +nightly fmt?

altonen · 2023-09-26T14:25:02Z

substrate/client/network/sync/src/lib.rs

+			warn!(
+				target: LOG_TARGET,
+				"Download state: unable to get block number of the downloaded blocks: \
+				common = {}, range = {:?}, downloaded = {}",
+				self.common_number, self.range, self.downloaded.len()
+			);
+			return None


Why would the block not be available?

It looks like right now the peer is still left in DownloadingNew state so I assume it will attempt to retry with the same failed result. It may be safer to disconnect the peer in case this function returns None unless syncing can somehow recover from this.

This is the existing behavior when blocks are validated: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/sync/src/lib.rs#L2796

Blocks are validated before adding to the self.downloaded list, so this should never happen (why this is a warn message). Initially thought of caching the block number when it is added to the list, but that would be redundant fields, and we need to worry about keeping them in sync

altonen · 2023-09-26T14:28:23Z

substrate/client/network/sync/src/lib.rs

+		let from = if peer_best_number == last {
+			FromBlock::Hash(*peer_best_hash)
+		} else {
+			FromBlock::Number(last)
+		};


Please add comment explaining this logic

This is existing behavior, just retained as is: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/sync/src/lib.rs#L2561

I think in case of downloading a fork we always need to use FromBlock::Hash, looking it up from the first block in the already downloaded range.

The request.from is the hash of the first block we want to download, which we don't know until the block is downloaded (now I understand why current code sets it hash if block_number == peer.best_block_number, in this case we happen to know the corresponding peer.best_block_hash)

Unless we want to set it to parent hash of the first block already downloaded. I would prefer to keep this uniform with existing code, don't know why current code does not set it to parent hash.

Yes, I meant setting it to the parent hash of the already downloaded first block in the range. Otherwise we can't be sure that we are downloading the right fork and can fail when importing the range.

Changed it to pass parent hash

altonen · 2023-09-26T14:37:56Z

substrate/client/network/sync/src/lib.rs

+		new_blocks.append(&mut self.downloaded);
+		self.downloaded.append(&mut new_blocks);


What is the purpose of this? Since it's dealing with the block bodies which can be large, it'd be better if we'd tried to handle them efficiently. Looking at this code alone I don't know whether it's making full copies of them or not but this code looks quite dubious in the first place. Why does it first have to append blocks from self.downloaded to new_blocks and then append them back to self.downloaded?

This prepends the new blocks to the existing blocks (updated the comment to reflect it). Prepending is needed because the blocks are returned in the reverse order (i.e) the request sets request.from = FromBlock::Number(x), and request.direction = Direction::Descending.

The Vec::append() is a move only operation, no copies/clones involved. The first append appends/empties self.downloaded, second one moves it back

altonen · 2023-09-26T14:40:38Z

substrate/client/network/sync/src/lib.rs

+		if self.range.start == (self.common_number + One::one()) &&
+			self.downloaded.len() < range_len


Comment for this check please. Otherwise it'll be hard to maintain this code because looking at this alone, I don't know what self.range.start == (self.common_number + One::one()) implies

I had comments below, moved it up and expanded it

This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?

If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.

This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?

BlockCollection currently does something slightly different: manages the downloaded range (with holes + merging holes) across several peers, which the PeerDownloadState manages a range with a single peer. It would be cleaner/less complex to keep this separation. The PeerDownloadState is just a staging area before adding the blocks to the collection.

If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.

This change does the same thing, except it enhances the state maintained by PeerSyncState::DownloadingNew, to manage download of new blocks from a single/unified path. I feel that adding separate variant only adds more complexity/code duplication

The check self.range.start == (self.common_number + One::one()) looks confusing and can be replaced with a single flag download_full_range, the value of which we know from the creation of PeerDownloadState. Otherwise we don't need self.common_number for anything. This is mostly equivalent to having a dedicated enum type for the case when we need to download the full range.

Having said that, IMO updating BlockCollection / importing logic to trigger import only when the parent hash corresponds to the already imported block (instead of comparison by number) would be less hacky than introducing a special case for downloading a range starting from common_number.

The check self.range.start == (self.common_number + One::one()) looks confusing and can be replaced with a single flag download_full_range, the value of which we know from the creation of PeerDownloadState. Otherwise we don't need self.common_number for anything. This is mostly equivalent to having a dedicated enum type for the case when we need to download the full range.

Yeah this looks cleaner, changed it to have a separate enum. The current DownloadingNew remains unchanged

Having said that, IMO updating BlockCollection / importing logic to trigger import only when the parent hash corresponds to the already imported block (instead of comparison by number) would be less hacky than introducing a special case for downloading a range starting from common_number.

Agreed. I looked at it, this looks like a much bigger change to use block hash instead of block number (this potentially touches several other paths). I feel this is better done as a separate change

altonen · 2023-09-26T14:41:49Z

substrate/client/network/sync/src/lib.rs

+		let mut downloaded = Vec::new();
+		downloaded.append(&mut self.downloaded);
+		collection.insert(start_block, downloaded, *who);


Same comment here, why is this append() necessary?

This is similar, it moves the completed blocks out and passes it to the BlockCollection

nazar-pc · 2023-09-26T17:26:23Z

@altonen this repo should seriously consider using Nightly in rust-toolchain.toml and run tests on stable with override in CI. I see every team member on our side is confused with how things are organized right now.

rahulksnv · 2023-09-26T17:30:25Z

@rahulksnv did you run cargo +nightly fmt?

Yes, I also ran the cmd from the failing job, but same behavior

nazar-pc · 2023-09-26T17:32:28Z

@rahulksnv update nightly to latest and cargo +nightly fmt will do the job

dmitry-markin

Looks very promising, left some comments.

substrate/client/network/sync/src/lib.rs

dmitry-markin · 2023-09-27T08:38:05Z

substrate/client/network/sync/src/lib.rs

+			return Err(BadPeer(*who, rep::BAD_RESPONSE))
+		}
+		let new_blocks_len = new_blocks.len();
+		new_blocks.append(&mut self.downloaded);


I would check first that start_block + new_blocks.len() == self.downloaded.first().number() (pseudocode) or start_block + new_blocks.len() == self.range.end() (if self.downloaded is empty). I.e., we can correctly prepend the range.

Refactored this part to break into smaller functions, added unit tests for the new paths

dmitry-markin · 2023-09-27T08:40:40Z

substrate/client/network/sync/src/lib.rs

+		Self { common_number, range, downloaded: Vec::new() }
+	}
+
+	/// Handles the new blocks received from the peer.


I would add a comment stating that it only works when downloading the range backwards, may be even rename it to reflect this (prepend_blocks?)

Changed name and expanded the comments

dmitry-markin · 2023-09-27T08:42:50Z

substrate/client/network/sync/src/lib.rs

+		if self.range.start == (self.common_number + One::one()) &&
+			self.downloaded.len() < range_len


This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?

If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.

dmitry-markin · 2023-09-27T08:52:43Z

substrate/client/network/sync/src/lib.rs

+		let from = if peer_best_number == last {
+			FromBlock::Hash(*peer_best_hash)
+		} else {
+			FromBlock::Number(last)
+		};


I think in case of downloading a fork we always need to use FromBlock::Hash, looking it up from the first block in the already downloaded range.

dmitry-markin · 2023-10-06T09:58:21Z

@rahulksnv sorry for not getting back with the review of PR for quite some time. We have been having internal discussions regarding the bug and the PR, and, to be honest, have not decided yet if we should merge this PR or properly fix the underlying issue (import triggered by number instead of parent hash) instead. We are currently looking if it's feasible to properly fix the import.

I'm going to publish the review comments I've already written, but it may be a good idea to hold back and wait until we have a clear understanding of what way of fixing the issue to prefer before addressing them. The review is not complete either, so this is another reason why I'd ask you to wait. We'll try to get back with the decision next week.

dmitry-markin · 2023-10-03T10:22:26Z

substrate/client/network/sync/src/lib.rs

+	}
+}
+
+/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging


Nitpick:

Suggested change

/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging

/// `PeerDownloadState` tracks the download from a fork. It acts as a staging

dmitry-markin · 2023-10-03T10:46:53Z

substrate/client/network/sync/src/lib.rs

+			downloaded.blocks.append(&mut new_blocks);
+			downloaded.start_block = start_block;
+			downloaded.start_block_parent_hash = start_block_parent_hash;


Nitpick:

Suggested change

downloaded.blocks.append(&mut new_blocks);

downloaded.start_block = start_block;

downloaded.start_block_parent_hash = start_block_parent_hash;

downloaded = PeerDownloaded { blocks: new_blocks, start_block, start_block_parent_hash };

This way we will catch possible errors if the new fields are added to PeerDownloaded in the future.

dmitry-markin · 2023-10-03T11:02:32Z

substrate/client/network/sync/src/lib.rs

+					let first_different = peer.common_number + One::one();
+					if range.start == first_different {
+						peer.state = PeerSyncState::DownloadingFork(PeerDownloadState::new(range));
+					} else {
+						peer.state = PeerSyncState::DownloadingNew(range.start);
+					}


Could you also add a comment here why we are handling DownloadFork case differently? And also add a TODO with a reference to the issue of triggering import by hash instead of number.

dmitry-markin · 2023-10-03T11:34:44Z

substrate/client/network/sync/src/lib.rs

+///    So instead, hold on to the partial response in this scenario, issue more requests to
+///    download the missing [916, 925). And submit to the block collection after the full
+///    range is downloaded.


Could you add a TODO to this description referencing the issue of triggering block import by hash instead of number (#1778)? Something like:

Suggested change

/// So instead, hold on to the partial response in this scenario, issue more requests to

/// download the missing [916, 925). And submit to the block collection after the full

/// range is downloaded.

/// So instead, hold on to the partial response in this scenario, issue more requests to

/// download the missing [916, 925). And submit to the block collection after the full

/// range is downloaded.

///

/// TODO: trigger block import by parent hash comparison instead of block number

/// comparison. See https://github.com/paritytech/polkadot-sdk/issues/1778.

rahulksnv · 2023-10-06T15:11:02Z

@rahulksnv sorry for not getting back with the review of PR for quite some time. We have been having internal discussions regarding the bug and the PR, and, to be honest, have not decided yet if we should merge this PR or properly fix the underlying issue (import triggered by number instead of parent hash) instead. We are currently looking if it's feasible to properly fix the import.

I'm going to publish the review comments I've already written, but it may be a good idea to hold back and wait until we have a clear understanding of what way of fixing the issue to prefer before addressing them. The review is not complete either, so this is another reason why I'd ask you to wait. We'll try to get back with the decision next week.

Thanks, will address the comments and hold on to it for now. I would also take a look at moving to a hash based scheme, to get an idea

dmitry-markin · 2023-10-07T15:12:23Z

CC @arkpar

dmitry-markin · 2023-10-07T17:19:46Z

@rahulksnv We can probably close this PR now as superseded by #1812.

rahulksnv · 2023-10-07T18:30:36Z

@rahulksnv We can probably close this PR now as superseded by #1812.

yeah, closing this in favor of #1812

rahulksnv mentioned this pull request Sep 26, 2023

Node can't sync with the network #493

Closed

2 tasks

".git/.scripts/commands/fmt/fmt.sh"

5b7c98a

altonen added R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”. labels Sep 26, 2023

altonen reviewed Sep 26, 2023

View reviewed changes

altonen requested a review from dmitry-markin September 26, 2023 14:43

dmitry-markin reviewed Sep 27, 2023

View reviewed changes

Refactor, address comments, more unit tests

4477868

rahulksnv force-pushed the issue-493 branch from 41626e8 to 4477868 Compare September 28, 2023 18:42

dmitry-markin had a problem deploying to master September 29, 2023 08:10 — with GitHub Actions Failure

dmitry-markin had a problem deploying to master September 29, 2023 08:43 — with GitHub Actions Failure

rahulksnv added 2 commits September 29, 2023 10:45

Separate PeerSyncState for download from fork

d32cdbf

Pass parent block hash in request.from

327f1fc

dmitry-markin mentioned this pull request Oct 3, 2023

[syncing] Trigger block import by hash instead of number #1778

Open

rahulksnv requested review from dmitry-markin and altonen October 5, 2023 07:56

dmitry-markin reviewed Oct 6, 2023

View reviewed changes

rahulksnv closed this Oct 7, 2023

		new_blocks.append(&mut self.downloaded);
		self.downloaded.append(&mut new_blocks);

		if self.range.start == (self.common_number + One::one()) &&
		self.downloaded.len() < range_len

	/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging
	/// `PeerDownloadState` tracks the download from a fork. It acts as a staging

Changes to handling partial responses #1686

Changes to handling partial responses #1686

Conversation

rahulksnv commented Sep 24, 2023

altonen commented Sep 26, 2023

command-bot bot commented Sep 26, 2023 • edited Loading

command-bot bot commented Sep 26, 2023

rahulksnv commented Sep 26, 2023

altonen commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc commented Sep 26, 2023

rahulksnv commented Sep 26, 2023

nazar-pc commented Sep 26, 2023

dmitry-markin left a comment

Choose a reason for hiding this comment

dmitry-markin Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-markin commented Oct 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulksnv commented Oct 6, 2023

dmitry-markin commented Oct 7, 2023

dmitry-markin commented Oct 7, 2023

rahulksnv commented Oct 7, 2023

command-bot bot commented Sep 26, 2023 •

edited

Loading

dmitry-markin Sep 27, 2023 •

edited

Loading