Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to handling partial responses #1686

Closed
wants to merge 5 commits into from

Conversation

rahulksnv
Copy link
Contributor

Context: #493 (comment)

Problem: When downloading from a fork, the peer may send a partial response that extends the current chain. Even though the response starts at a block number after the current best number, the parent block from the fork is not yet downloaded/imported. We currently try to import the partial response in this scenario, which causes import to fail as parent is not found in the backend. This causes sync to be aborted/restarted which may not converge.

Fix: when we receive a partial response in this scenario, hold on to the partial blocks and try to download the missing blocks. And import when the full range is downloaded.

Context: paritytech#493 (comment)

Problem: When downloading from a fork, the peer may send a partial response
that extends the current chain. Even though the response starts at a block
number after the current best number, the parent block from the fork is not
yet downloaded/imported. We currently try to import the partial response
in this scenario, which causes import to fail as parent is not found in the
backend. This causes sync to be aborted/restarted which may not converge.

Fix: when we receive a partial response in this scenario, hold on to the
partial blocks and try to download the missing blocks. And import when the
full range is downloaded.
@rahulksnv rahulksnv mentioned this pull request Sep 26, 2023
2 tasks
@altonen
Copy link
Contributor

altonen commented Sep 26, 2023

bot fmt

@command-bot
Copy link

command-bot bot commented Sep 26, 2023

@altonen https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-42f79efd-39db-4927-b68e-3f722f8806f4 to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Sep 26, 2023

@altonen Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758/artifacts/download.

@altonen altonen added R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”. labels Sep 26, 2023
@rahulksnv
Copy link
Contributor Author

bot fmt

I don't understand how the fmt works with these changes. I try running cargo fmt locally (or with the command from the failing fmt job), but it formats several files not part of the change, or most of code in same file that the PR does not touch. The contributor guidelines also does not talk about this much.

@altonen
Copy link
Contributor

altonen commented Sep 26, 2023

@rahulksnv did you run cargo +nightly fmt?

Comment on lines 499 to 505
warn!(
target: LOG_TARGET,
"Download state: unable to get block number of the downloaded blocks: \
common = {}, range = {:?}, downloaded = {}",
self.common_number, self.range, self.downloaded.len()
);
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the block not be available?

It looks like right now the peer is still left in DownloadingNew state so I assume it will attempt to retry with the same failed result. It may be safer to disconnect the peer in case this function returns None unless syncing can somehow recover from this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the existing behavior when blocks are validated: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/sync/src/lib.rs#L2796

Blocks are validated before adding to the self.downloaded list, so this should never happen (why this is a warn message). Initially thought of caching the block number when it is added to the list, but that would be redundant fields, and we need to worry about keeping them in sync

Comment on lines 511 to 515
let from = if peer_best_number == last {
FromBlock::Hash(*peer_best_hash)
} else {
FromBlock::Number(last)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment explaining this logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of downloading a fork we always need to use FromBlock::Hash, looking it up from the first block in the already downloaded range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request.from is the hash of the first block we want to download, which we don't know until the block is downloaded (now I understand why current code sets it hash if block_number == peer.best_block_number, in this case we happen to know the corresponding peer.best_block_hash)

Unless we want to set it to parent hash of the first block already downloaded. I would prefer to keep this uniform with existing code, don't know why current code does not set it to parent hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant setting it to the parent hash of the already downloaded first block in the range. Otherwise we can't be sure that we are downloading the right fork and can fail when importing the range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to pass parent hash

Comment on lines 462 to 463
new_blocks.append(&mut self.downloaded);
self.downloaded.append(&mut new_blocks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this? Since it's dealing with the block bodies which can be large, it'd be better if we'd tried to handle them efficiently. Looking at this code alone I don't know whether it's making full copies of them or not but this code looks quite dubious in the first place. Why does it first have to append blocks from self.downloaded to new_blocks and then append them back to self.downloaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prepends the new blocks to the existing blocks (updated the comment to reflect it). Prepending is needed because the blocks are returned in the reverse order (i.e) the request sets request.from = FromBlock::Number(x), and request.direction = Direction::Descending.

The Vec::append() is a move only operation, no copies/clones involved. The first append appends/empties self.downloaded, second one moves it back

Comment on lines 465 to 466
if self.range.start == (self.common_number + One::one()) &&
self.downloaded.len() < range_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for this check please. Otherwise it'll be hard to maintain this code because looking at this alone, I don't know what self.range.start == (self.common_number + One::one()) implies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had comments below, moved it up and expanded it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?

If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?

BlockCollection currently does something slightly different: manages the downloaded range (with holes + merging holes) across several peers, which the PeerDownloadState manages a range with a single peer. It would be cleaner/less complex to keep this separation. The PeerDownloadState is just a staging area before adding the blocks to the collection.

If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.

This change does the same thing, except it enhances the state maintained by PeerSyncState::DownloadingNew, to manage download of new blocks from a single/unified path. I feel that adding separate variant only adds more complexity/code duplication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check self.range.start == (self.common_number + One::one()) looks confusing and can be replaced with a single flag download_full_range, the value of which we know from the creation of PeerDownloadState. Otherwise we don't need self.common_number for anything. This is mostly equivalent to having a dedicated enum type for the case when we need to download the full range.

Having said that, IMO updating BlockCollection / importing logic to trigger import only when the parent hash corresponds to the already imported block (instead of comparison by number) would be less hacky than introducing a special case for downloading a range starting from common_number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check self.range.start == (self.common_number + One::one()) looks confusing and can be replaced with a single flag download_full_range, the value of which we know from the creation of PeerDownloadState. Otherwise we don't need self.common_number for anything. This is mostly equivalent to having a dedicated enum type for the case when we need to download the full range.

Yeah this looks cleaner, changed it to have a separate enum. The current DownloadingNew remains unchanged

Having said that, IMO updating BlockCollection / importing logic to trigger import only when the parent hash corresponds to the already imported block (instead of comparison by number) would be less hacky than introducing a special case for downloading a range starting from common_number.

Agreed. I looked at it, this looks like a much bigger change to use block hash instead of block number (this potentially touches several other paths). I feel this is better done as a separate change

Comment on lines 481 to 483
let mut downloaded = Vec::new();
downloaded.append(&mut self.downloaded);
collection.insert(start_block, downloaded, *who);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, why is this append() necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar, it moves the completed blocks out and passes it to the BlockCollection

@nazar-pc
Copy link
Contributor

@altonen this repo should seriously consider using Nightly in rust-toolchain.toml and run tests on stable with override in CI. I see every team member on our side is confused with how things are organized right now.

@rahulksnv
Copy link
Contributor Author

@rahulksnv did you run cargo +nightly fmt?

Yes, I also ran the cmd from the failing job, but same behavior

@nazar-pc
Copy link
Contributor

@rahulksnv update nightly to latest and cargo +nightly fmt will do the job

Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very promising, left some comments.

substrate/client/network/sync/src/lib.rs Outdated Show resolved Hide resolved
return Err(BadPeer(*who, rep::BAD_RESPONSE))
}
let new_blocks_len = new_blocks.len();
new_blocks.append(&mut self.downloaded);
Copy link
Contributor

@dmitry-markin dmitry-markin Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check first that start_block + new_blocks.len() == self.downloaded.first().number() (pseudocode) or start_block + new_blocks.len() == self.range.end() (if self.downloaded is empty). I.e., we can correctly prepend the range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this part to break into smaller functions, added unit tests for the new paths

Self { common_number, range, downloaded: Vec::new() }
}

/// Handles the new blocks received from the peer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment stating that it only works when downloading the range backwards, may be even rename it to reflect this (prepend_blocks?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed name and expanded the comments

Comment on lines 465 to 466
if self.range.start == (self.common_number + One::one()) &&
self.downloaded.len() < range_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?

If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.

Comment on lines 511 to 515
let from = if peer_best_number == last {
FromBlock::Hash(*peer_best_hash)
} else {
FromBlock::Number(last)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of downloading a fork we always need to use FromBlock::Hash, looking it up from the first block in the already downloaded range.

@dmitry-markin
Copy link
Contributor

@rahulksnv sorry for not getting back with the review of PR for quite some time. We have been having internal discussions regarding the bug and the PR, and, to be honest, have not decided yet if we should merge this PR or properly fix the underlying issue (import triggered by number instead of parent hash) instead. We are currently looking if it's feasible to properly fix the import.

I'm going to publish the review comments I've already written, but it may be a good idea to hold back and wait until we have a clear understanding of what way of fixing the issue to prefer before addressing them. The review is not complete either, so this is another reason why I'd ask you to wait. We'll try to get back with the decision next week.

}
}

/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:

Suggested change
/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging
/// `PeerDownloadState` tracks the download from a fork. It acts as a staging

Comment on lines +499 to +501
downloaded.blocks.append(&mut new_blocks);
downloaded.start_block = start_block;
downloaded.start_block_parent_hash = start_block_parent_hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:

Suggested change
downloaded.blocks.append(&mut new_blocks);
downloaded.start_block = start_block;
downloaded.start_block_parent_hash = start_block_parent_hash;
downloaded = PeerDownloaded { blocks: new_blocks, start_block, start_block_parent_hash };

This way we will catch possible errors if the new fields are added to PeerDownloaded in the future.

Comment on lines +2216 to +2221
let first_different = peer.common_number + One::one();
if range.start == first_different {
peer.state = PeerSyncState::DownloadingFork(PeerDownloadState::new(range));
} else {
peer.state = PeerSyncState::DownloadingNew(range.start);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a comment here why we are handling DownloadFork case differently? And also add a TODO with a reference to the issue of triggering import by hash instead of number.

Comment on lines +433 to +435
/// So instead, hold on to the partial response in this scenario, issue more requests to
/// download the missing [916, 925). And submit to the block collection after the full
/// range is downloaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a TODO to this description referencing the issue of triggering block import by hash instead of number (#1778)? Something like:

Suggested change
/// So instead, hold on to the partial response in this scenario, issue more requests to
/// download the missing [916, 925). And submit to the block collection after the full
/// range is downloaded.
/// So instead, hold on to the partial response in this scenario, issue more requests to
/// download the missing [916, 925). And submit to the block collection after the full
/// range is downloaded.
///
/// TODO: trigger block import by parent hash comparison instead of block number
/// comparison. See https://github.com/paritytech/polkadot-sdk/issues/1778.

@rahulksnv
Copy link
Contributor Author

@rahulksnv sorry for not getting back with the review of PR for quite some time. We have been having internal discussions regarding the bug and the PR, and, to be honest, have not decided yet if we should merge this PR or properly fix the underlying issue (import triggered by number instead of parent hash) instead. We are currently looking if it's feasible to properly fix the import.

I'm going to publish the review comments I've already written, but it may be a good idea to hold back and wait until we have a clear understanding of what way of fixing the issue to prefer before addressing them. The review is not complete either, so this is another reason why I'd ask you to wait. We'll try to get back with the decision next week.

Thanks, will address the comments and hold on to it for now. I would also take a look at moving to a hash based scheme, to get an idea

@dmitry-markin
Copy link
Contributor

CC @arkpar

@dmitry-markin
Copy link
Contributor

@rahulksnv We can probably close this PR now as superseded by #1812.

@rahulksnv
Copy link
Contributor Author

@rahulksnv We can probably close this PR now as superseded by #1812.

yeah, closing this in favor of #1812

@rahulksnv rahulksnv closed this Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants