Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvements to deep re-org logic #16457

Closed
wants to merge 12 commits into from
Closed

Conversation

arvidn
Copy link
Contributor

@arvidn arvidn commented Sep 29, 2023

This PR is best reviewed one commit at a time.

Purpose:

The node is currently having a hard time with deep reorgs. Some parts of block validation require some previous blocks to be loaded in the block record cache in the blockchain object. When performing a reorg deeper than the cache size, those block records may not be in the cache and fail.

Furthermore, when validating blocks on the fork of the chain (before accepting it and updating our peak) we need to be able to pick up block records (and generators) from blocks that are not in the main chain. This requires looking them up from the database. One case of interest, that has previously not been covered by a test, is when a block reference points into the forked chain (i.e. a block not in the main chain).

The important change in this PR are the last 3 commits, (8, 9 and 10 in the list below).

The other commits are tests and minor optimizations.

Scope

This PR introduces tests covering cases we've seen fail on testnet recently. Additionally it fixes the logic to work as intended. Now, the way it was intended to work was very inefficient, so there are still serious efficiency problems with deep reorgs. Since this patch is already pretty large, I think it makes sense to split up the work and address the performance issues in separate PRs.

Test chains

Since this patch alters the test chains (specifically, default_10000_blocks and test_long_reorg_blocks) CI will fail some tests that use them. CI requires test chains to have been committed to the test-cache repository and been published in a release for it to be able to pick them up.

If the general approach in this PR is approved, I will push the updated test chains and update this PR to use them, to get CI green. Here's the PR for the new test chains: Chia-Network/test-cache#8

Summary of changes:

In order to test long reorgs with block references, BlockTools need to be updated to support creating chains with block references. To avoid this change clashing with main later, the first commit is pulled from main, which also updates BlockTools. For details on this change, see #16326 .

commits:

  1. back-port of BlockTools change from main.

  2. Introduces a feature dummy_block_references and updates the two test blockchains default_10000_blocks and test_long_reorg_blocks to use them. These two chains will form the foundation of the new tests exercising the long reorg logic (with block references). This commit also introduces a second long_reorg test chain with the same weight as the main chain it's forking from (as opposed to higher weigh).

  3. Updates test_long_reorg in test_blockchain.py to make the reorg longer than the size of the block record cache. It also makes it use the new test chains which include block references. It also clears the block reference cache to more accurately match a real-life scenario.

  4. Add a new test to test_full_node.py to do the same thing but at the full_node level, making for a more authentic test (and slower).

  5. Add a new test to test_full_node.py which sets up two full nodes with different forks, connect them and wait for them to have synchronized. This test exposed an issue where we end up requesting a block the other peer does not have. This issue need more investigation.

  6. There was a test in test_full_node.py that used to be parametrized, and would only run its reorg test in one of the test runs. Since then, it appears changes to the parameters have caused the reorg test to never run. The next commit runs it unconditionally.

  7. Some minor optimizations to avoid duplicate dictionary lookups and multiple computations of the header hash of blocks.

  8. This makes the parts of block validation that currently only look in the cache, also able to fall back to pull blocks from the database. Using async functions instead of immediate ones.

  9. This makes the find_fork_poin_in_chain() async and also able to find blocks in the database that aren't in the cache.

  10. Update the version of block-cache blocks we pull on CI, to use the new test chains.

  11. Optimize find_fork_point_in_chain() to mitigate some of the issue of CI taking too ling.

  12. Before initiating a sync, we collect peaks from at least 3 peers, unless that takes more than 30 seconds, in which case we start syncing anyway. This commit makes this timeout configurable, and configures it to 1 second in the tests. This shaves off a 30 second wait in the tests introduced in (5).

Current Behavior:

In a long/deep reorg, the node may fail with KeyError while looking up block records that are not in the cache.

New Behavior:

Long/deep reorgs work as expected.

Testing Notes:

There are two new relevant tests added, one in test_blockchain.py and one in test_full_node.py. These fail without the fix.

@arvidn arvidn force-pushed the reorg-improvements branch 2 times, most recently from c03bced to 2d33f3e Compare October 1, 2023 11:53
@arvidn arvidn added the Fixed Required label for PR that categorizes merge commit message as "Fixed" for changelog label Oct 1, 2023
@arvidn arvidn changed the title Reorg improvements improvements to deep re-org logic Oct 1, 2023
# Reorg longer than a difficulty adjustment
# Also tests higher weight chain but lower height
b = empty_blockchain
num_blocks_chain_1 = 3 * bt.constants.EPOCH_BLOCKS + bt.constants.MAX_SUB_SLOT_BLOCKS + 10
num_blocks_chain_2_start = bt.constants.EPOCH_BLOCKS - 20
num_blocks_chain_1 = 2000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this could change the test since the relation to the constants is not enforced anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the test wants to make sure there are at least 3 epochs before the fork. I don't think the exact block height matters so much other than that. Do you think there's something subtle being tested by this exact fork height?


def block_record(self, header_hash: bytes32) -> BlockRecord:
return self._block_records[header_hash]

async def get_block_record_from_db(self, header_hash: bytes32) -> Optional[BlockRecord]:
return self._block_records.get(header_hash)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i get the naming, where is the db access ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. This is implementing the BlockchainInterface which is kind-of like an abstract base class

@@ -1253,7 +1253,8 @@ async def add_block_batch(
if error is not None:
self.log.error(f"Error: {error}, Invalid block from peer: {peer.get_peer_logging()} ")
return False, agg_state_change_summary, error
block_record = self.blockchain.block_record(block.header_hash)
block_record = await self.blockchain.get_block_record_from_db(block.header_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the record should be in the cache, im afraid this can hide/open some problems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we should add code to populate the cache before getting here, or do you mean that there is already code to do this? (the tests would fail without being able to fall back to the DB here, so if there is logic that's meant to populate the cache, I don't think it's working right).

I haven't found code that looks like it attempts to make sure this block is in the cache.

if prev_b is not None and not block_records.contains_block(block.prev_header_hash):
# the call to block_to_block_record() requires the previous
# block is in the cache
block_records.add_block_record(prev_b)
block_rec = block_to_block_record(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we traverse back to the latest tx block inside block_to_block_record so not only the prev block we need all the blocks up to the latest tx block, it wont fail but it will pass prev_transaction_block_height = 0 to header_block_to_sub_block_record at the end

# get_next_sub_slot_iters_and_difficulty() requires prev_b.prev_hash to
# be in the block record cache.
if prev_b is not None and prev_b.height != 0 and not block_records.contains_block(prev_b.prev_hash):
return [PreValidationResult(uint16(Err.INVALID_PREV_BLOCK_HASH.value), None, None, False)]
sub_slot_iters, difficulty = get_next_sub_slot_iters_and_difficulty(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the cache here as well and we traverse to different points that i think this breaks, for instance we need the previous previous block and we need the last transaction block before this block's signage point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it passes the tests by "accident". This is definitely the part of the patch that I'm the least confident in.
We have a test that ensures this function fail with this error under certain circumstances, but the new reorg tests also need to be able to validate blocks whose immediate history is not in the cache. It was tempting to make all of this logic fall back to the DB if a block isn't in the cache, but that breaks another, existing, test.

@@ -41,6 +41,9 @@ def contains_block(self, header_hash: bytes32) -> bool:
# ignoring hinting error until we handle our interfaces more formally
return # type: ignore[return-value]

async def contains_block_from_db(self, header_hash: bytes32) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update BlockCache as well so it will still implement the interface

…ious generators in test chains. Use a longer chain, with block generators, in the long_reorg test

add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak
@arvidn arvidn force-pushed the reorg-improvements branch 2 times, most recently from c74bd26 to 4faa332 Compare October 4, 2023 13:20
…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed
…onds in the simulation tests to avoid waiting on CI
@arvidn arvidn closed this Oct 11, 2023
@arvidn arvidn deleted the reorg-improvements branch October 11, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fixed Required label for PR that categorizes merge commit message as "Fixed" for changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants