improvements to deep re-org logic #16457

arvidn · 2023-09-29T15:08:13Z

This PR is best reviewed one commit at a time.

Purpose:

The node is currently having a hard time with deep reorgs. Some parts of block validation require some previous blocks to be loaded in the block record cache in the blockchain object. When performing a reorg deeper than the cache size, those block records may not be in the cache and fail.

Furthermore, when validating blocks on the fork of the chain (before accepting it and updating our peak) we need to be able to pick up block records (and generators) from blocks that are not in the main chain. This requires looking them up from the database. One case of interest, that has previously not been covered by a test, is when a block reference points into the forked chain (i.e. a block not in the main chain).

The important change in this PR are the last 3 commits, (8, 9 and 10 in the list below).

The other commits are tests and minor optimizations.

Scope

This PR introduces tests covering cases we've seen fail on testnet recently. Additionally it fixes the logic to work as intended. Now, the way it was intended to work was very inefficient, so there are still serious efficiency problems with deep reorgs. Since this patch is already pretty large, I think it makes sense to split up the work and address the performance issues in separate PRs.

Test chains

Since this patch alters the test chains (specifically, default_10000_blocks and test_long_reorg_blocks) CI will fail some tests that use them. CI requires test chains to have been committed to the test-cache repository and been published in a release for it to be able to pick them up.

If the general approach in this PR is approved, I will push the updated test chains and update this PR to use them, to get CI green. Here's the PR for the new test chains: Chia-Network/test-cache#8

Summary of changes:

In order to test long reorgs with block references, BlockTools need to be updated to support creating chains with block references. To avoid this change clashing with main later, the first commit is pulled from main, which also updates BlockTools. For details on this change, see #16326 .

commits:

back-port of BlockTools change from main.
Introduces a feature dummy_block_references and updates the two test blockchains default_10000_blocks and test_long_reorg_blocks to use them. These two chains will form the foundation of the new tests exercising the long reorg logic (with block references). This commit also introduces a second long_reorg test chain with the same weight as the main chain it's forking from (as opposed to higher weigh).
Updates test_long_reorg in test_blockchain.py to make the reorg longer than the size of the block record cache. It also makes it use the new test chains which include block references. It also clears the block reference cache to more accurately match a real-life scenario.
Add a new test to test_full_node.py to do the same thing but at the full_node level, making for a more authentic test (and slower).
Add a new test to test_full_node.py which sets up two full nodes with different forks, connect them and wait for them to have synchronized. This test exposed an issue where we end up requesting a block the other peer does not have. This issue need more investigation.
There was a test in test_full_node.py that used to be parametrized, and would only run its reorg test in one of the test runs. Since then, it appears changes to the parameters have caused the reorg test to never run. The next commit runs it unconditionally.
Some minor optimizations to avoid duplicate dictionary lookups and multiple computations of the header hash of blocks.
This makes the parts of block validation that currently only look in the cache, also able to fall back to pull blocks from the database. Using async functions instead of immediate ones.
This makes the find_fork_poin_in_chain() async and also able to find blocks in the database that aren't in the cache.
Update the version of block-cache blocks we pull on CI, to use the new test chains.
Optimize find_fork_point_in_chain() to mitigate some of the issue of CI taking too ling.
Before initiating a sync, we collect peaks from at least 3 peers, unless that takes more than 30 seconds, in which case we start syncing anyway. This commit makes this timeout configurable, and configures it to 1 second in the tests. This shaves off a 30 second wait in the tests introduced in (5).

Current Behavior:

In a long/deep reorg, the node may fail with KeyError while looking up block records that are not in the cache.

New Behavior:

Long/deep reorgs work as expected.

Testing Notes:

There are two new relevant tests added, one in test_blockchain.py and one in test_full_node.py. These fail without the fix.

almogdepaz · 2023-10-02T09:36:05Z

tests/blockchain/test_blockchain.py

        # Reorg longer than a difficulty adjustment
        # Also tests higher weight chain but lower height
        b = empty_blockchain
-        num_blocks_chain_1 = 3 * bt.constants.EPOCH_BLOCKS + bt.constants.MAX_SUB_SLOT_BLOCKS + 10
-        num_blocks_chain_2_start = bt.constants.EPOCH_BLOCKS - 20
+        num_blocks_chain_1 = 2000


looks like this could change the test since the relation to the constants is not enforced anymore

My understanding is that the test wants to make sure there are at least 3 epochs before the fork. I don't think the exact block height matters so much other than that. Do you think there's something subtle being tested by this exact fork height?

almogdepaz · 2023-10-02T09:41:54Z

chia/wallet/wallet_blockchain.py


    def block_record(self, header_hash: bytes32) -> BlockRecord:
        return self._block_records[header_hash]

+    async def get_block_record_from_db(self, header_hash: bytes32) -> Optional[BlockRecord]:
+        return self._block_records.get(header_hash)
+


not sure i get the naming, where is the db access ?

I'll add a comment. This is implementing the BlockchainInterface which is kind-of like an abstract base class

almogdepaz · 2023-10-02T09:50:33Z

chia/full_node/full_node.py

@@ -1253,7 +1253,8 @@ async def add_block_batch(
                if error is not None:
                    self.log.error(f"Error: {error}, Invalid block from peer: {peer.get_peer_logging()} ")
                return False, agg_state_change_summary, error
-            block_record = self.blockchain.block_record(block.header_hash)
+            block_record = await self.blockchain.get_block_record_from_db(block.header_hash)


the record should be in the cache, im afraid this can hide/open some problems

Do you mean that we should add code to populate the cache before getting here, or do you mean that there is already code to do this? (the tests would fail without being able to fall back to the DB here, so if there is logic that's meant to populate the cache, I don't think it's working right).

I haven't found code that looks like it attempts to make sure this block is in the cache.

almogdepaz · 2023-10-02T12:36:33Z

chia/consensus/multiprocess_validation.py

+            if prev_b is not None and not block_records.contains_block(block.prev_header_hash):
+                # the call to block_to_block_record() requires the previous
+                # block is in the cache
+                block_records.add_block_record(prev_b)
            block_rec = block_to_block_record(


we traverse back to the latest tx block inside block_to_block_record so not only the prev block we need all the blocks up to the latest tx block, it wont fail but it will pass prev_transaction_block_height = 0 to header_block_to_sub_block_record at the end

almogdepaz · 2023-10-02T12:41:26Z

chia/consensus/multiprocess_validation.py

+        # get_next_sub_slot_iters_and_difficulty() requires prev_b.prev_hash to
+        # be in the block record cache.
+        if prev_b is not None and prev_b.height != 0 and not block_records.contains_block(prev_b.prev_hash):
+            return [PreValidationResult(uint16(Err.INVALID_PREV_BLOCK_HASH.value), None, None, False)]
        sub_slot_iters, difficulty = get_next_sub_slot_iters_and_difficulty(


we use the cache here as well and we traverse to different points that i think this breaks, for instance we need the previous previous block and we need the last transaction block before this block's signage point

maybe it passes the tests by "accident". This is definitely the part of the patch that I'm the least confident in.
We have a test that ensures this function fail with this error under certain circumstances, but the new reorg tests also need to be able to validate blocks whose immediate history is not in the cache. It was tempting to make all of this logic fall back to the DB if a block isn't in the cache, but that breaks another, existing, test.

almogdepaz · 2023-10-02T12:44:59Z

chia/consensus/blockchain_interface.py

@@ -41,6 +41,9 @@ def contains_block(self, header_hash: bytes32) -> bool:
        # ignoring hinting error until we handle our interfaces more formally
        return  # type: ignore[return-value]

+    async def contains_block_from_db(self, header_hash: bytes32) -> bool:


need to update BlockCache as well so it will still implement the interface

…ious generators in test chains. Use a longer chain, with block generators, in the long_reorg test add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak

… generators and block references

…imes. Avoid multiple lookups into the BlockRecord cache

…m the database

…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed

…onds in the simulation tests to avoid waiting on CI

Better timestamps on simulated chains (#16326)

ef0ca47

arvidn force-pushed the reorg-improvements branch 2 times, most recently from c03bced to 2d33f3e Compare October 1, 2023 11:53

arvidn added the Fixed Required label for PR that categorizes merge commit message as "Fixed" for changelog label Oct 1, 2023

arvidn changed the title ~~Reorg improvements~~ improvements to deep re-org logic Oct 1, 2023

arvidn requested review from emlowe, wjblanke and almogdepaz October 1, 2023 13:52

arvidn force-pushed the reorg-improvements branch from 2d33f3e to e0f4cd4 Compare October 1, 2023 15:10

This was referenced Oct 2, 2023

update test chains to include block-references Chia-Network/test-cache#8

Merged

improve performance of find-fork-point #16471

Closed

almogdepaz reviewed Oct 2, 2023

View reviewed changes

arvidn force-pushed the reorg-improvements branch from e0f4cd4 to d7ce721 Compare October 3, 2023 22:16

arvidn force-pushed the reorg-improvements branch from d7ce721 to 71edffa Compare October 4, 2023 00:04

arvidn added 2 commits October 4, 2023 08:19

update test_long_reorg to use the new, longer, test chains with block…

ac2a85f

… generators and block references

add new test_long_reorg operating on full node

ef1a6c2

arvidn force-pushed the reorg-improvements branch 2 times, most recently from c74bd26 to 4faa332 Compare October 4, 2023 13:20

arvidn added 6 commits October 5, 2023 12:16

add test with two nodes with a deep fork, where one syncs to the other

c1943b1

re-enable reorg test in FullNodeBlockCompression

7f8af45

minor optimizations. Avoid computing FullBlock.header_hash multiple t…

d57f82b

…imes. Avoid multiple lookups into the BlockRecord cache

update block validation to pull block dependencies from the database

f572f48

transition find_fork_point_in_chain() to be async and pull blocks fro…

0abc390

…m the database

bump test chains used on CI

c98f1db

arvidn added 2 commits October 5, 2023 12:16

optimize find_fork_point_in_chain() to just look up prev_hash from th…

618e83f

…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed

make the timeout for collecting 3 peaks configurable. Set it to 1 sec…

650c5e3

…onds in the simulation tests to avoid waiting on CI

arvidn force-pushed the reorg-improvements branch from 4faa332 to 650c5e3 Compare October 5, 2023 10:16

arvidn closed this Oct 11, 2023

arvidn deleted the reorg-improvements branch October 11, 2023 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvements to deep re-org logic #16457

improvements to deep re-org logic #16457

arvidn commented Sep 29, 2023 •

edited

Loading

almogdepaz Oct 2, 2023

arvidn Oct 2, 2023

almogdepaz Oct 2, 2023

arvidn Oct 2, 2023

almogdepaz Oct 2, 2023

arvidn Oct 2, 2023

almogdepaz Oct 2, 2023

almogdepaz Oct 2, 2023

arvidn Oct 2, 2023

almogdepaz Oct 2, 2023

improvements to deep re-org logic #16457

improvements to deep re-org logic #16457

Conversation

arvidn commented Sep 29, 2023 • edited Loading

Purpose:

Scope

Test chains

Summary of changes:

Current Behavior:

New Behavior:

Testing Notes:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arvidn commented Sep 29, 2023 •

edited

Loading