-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improvements to deep re-org logic #16457
Conversation
c03bced
to
2d33f3e
Compare
2d33f3e
to
e0f4cd4
Compare
tests/blockchain/test_blockchain.py
Outdated
# Reorg longer than a difficulty adjustment | ||
# Also tests higher weight chain but lower height | ||
b = empty_blockchain | ||
num_blocks_chain_1 = 3 * bt.constants.EPOCH_BLOCKS + bt.constants.MAX_SUB_SLOT_BLOCKS + 10 | ||
num_blocks_chain_2_start = bt.constants.EPOCH_BLOCKS - 20 | ||
num_blocks_chain_1 = 2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this could change the test since the relation to the constants is not enforced anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the test wants to make sure there are at least 3 epochs before the fork. I don't think the exact block height matters so much other than that. Do you think there's something subtle being tested by this exact fork height?
|
||
def block_record(self, header_hash: bytes32) -> BlockRecord: | ||
return self._block_records[header_hash] | ||
|
||
async def get_block_record_from_db(self, header_hash: bytes32) -> Optional[BlockRecord]: | ||
return self._block_records.get(header_hash) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i get the naming, where is the db access ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment. This is implementing the BlockchainInterface
which is kind-of like an abstract base class
@@ -1253,7 +1253,8 @@ async def add_block_batch( | |||
if error is not None: | |||
self.log.error(f"Error: {error}, Invalid block from peer: {peer.get_peer_logging()} ") | |||
return False, agg_state_change_summary, error | |||
block_record = self.blockchain.block_record(block.header_hash) | |||
block_record = await self.blockchain.get_block_record_from_db(block.header_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the record should be in the cache, im afraid this can hide/open some problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we should add code to populate the cache before getting here, or do you mean that there is already code to do this? (the tests would fail without being able to fall back to the DB here, so if there is logic that's meant to populate the cache, I don't think it's working right).
I haven't found code that looks like it attempts to make sure this block is in the cache.
if prev_b is not None and not block_records.contains_block(block.prev_header_hash): | ||
# the call to block_to_block_record() requires the previous | ||
# block is in the cache | ||
block_records.add_block_record(prev_b) | ||
block_rec = block_to_block_record( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we traverse back to the latest tx block inside block_to_block_record so not only the prev block we need all the blocks up to the latest tx block, it wont fail but it will pass prev_transaction_block_height = 0 to header_block_to_sub_block_record at the end
# get_next_sub_slot_iters_and_difficulty() requires prev_b.prev_hash to | ||
# be in the block record cache. | ||
if prev_b is not None and prev_b.height != 0 and not block_records.contains_block(prev_b.prev_hash): | ||
return [PreValidationResult(uint16(Err.INVALID_PREV_BLOCK_HASH.value), None, None, False)] | ||
sub_slot_iters, difficulty = get_next_sub_slot_iters_and_difficulty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use the cache here as well and we traverse to different points that i think this breaks, for instance we need the previous previous block and we need the last transaction block before this block's signage point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it passes the tests by "accident". This is definitely the part of the patch that I'm the least confident in.
We have a test that ensures this function fail with this error under certain circumstances, but the new reorg tests also need to be able to validate blocks whose immediate history is not in the cache. It was tempting to make all of this logic fall back to the DB if a block isn't in the cache, but that breaks another, existing, test.
@@ -41,6 +41,9 @@ def contains_block(self, header_hash: bytes32) -> bool: | |||
# ignoring hinting error until we handle our interfaces more formally | |||
return # type: ignore[return-value] | |||
|
|||
async def contains_block_from_db(self, header_hash: bytes32) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update BlockCache as well so it will still implement the interface
e0f4cd4
to
d7ce721
Compare
…ious generators in test chains. Use a longer chain, with block generators, in the long_reorg test add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak
d7ce721
to
71edffa
Compare
… generators and block references
c74bd26
to
4faa332
Compare
…imes. Avoid multiple lookups into the BlockRecord cache
…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed
…onds in the simulation tests to avoid waiting on CI
4faa332
to
650c5e3
Compare
This PR is best reviewed one commit at a time.
Purpose:
The node is currently having a hard time with deep reorgs. Some parts of block validation require some previous blocks to be loaded in the block record cache in the blockchain object. When performing a reorg deeper than the cache size, those block records may not be in the cache and fail.
Furthermore, when validating blocks on the fork of the chain (before accepting it and updating our peak) we need to be able to pick up block records (and generators) from blocks that are not in the main chain. This requires looking them up from the database. One case of interest, that has previously not been covered by a test, is when a block reference points into the forked chain (i.e. a block not in the main chain).
The important change in this PR are the last 3 commits, (8, 9 and 10 in the list below).
The other commits are tests and minor optimizations.
Scope
This PR introduces tests covering cases we've seen fail on testnet recently. Additionally it fixes the logic to work as intended. Now, the way it was intended to work was very inefficient, so there are still serious efficiency problems with deep reorgs. Since this patch is already pretty large, I think it makes sense to split up the work and address the performance issues in separate PRs.
Test chains
Since this patch alters the test chains (specifically,
default_10000_blocks
andtest_long_reorg_blocks
) CI will fail some tests that use them. CI requires test chains to have been committed to thetest-cache
repository and been published in a release for it to be able to pick them up.If the general approach in this PR is approved, I will push the updated test chains and update this PR to use them, to get CI green. Here's the PR for the new test chains: Chia-Network/test-cache#8
Summary of changes:
In order to test long reorgs with block references,
BlockTools
need to be updated to support creating chains with block references. To avoid this change clashing withmain
later, the first commit is pulled frommain
, which also updatesBlockTools
. For details on this change, see #16326 .commits:
back-port of
BlockTools
change frommain
.Introduces a feature
dummy_block_references
and updates the two test blockchainsdefault_10000_blocks
andtest_long_reorg_blocks
to use them. These two chains will form the foundation of the new tests exercising the long reorg logic (with block references). This commit also introduces a secondlong_reorg
test chain with the same weight as the main chain it's forking from (as opposed to higher weigh).Updates
test_long_reorg
intest_blockchain.py
to make the reorg longer than the size of the block record cache. It also makes it use the new test chains which include block references. It also clears the block reference cache to more accurately match a real-life scenario.Add a new test to
test_full_node.py
to do the same thing but at thefull_node
level, making for a more authentic test (and slower).Add a new test to
test_full_node.py
which sets up two full nodes with different forks, connect them and wait for them to have synchronized. This test exposed an issue where we end up requesting a block the other peer does not have. This issue need more investigation.There was a test in
test_full_node.py
that used to be parametrized, and would only run its reorg test in one of the test runs. Since then, it appears changes to the parameters have caused the reorg test to never run. The next commit runs it unconditionally.Some minor optimizations to avoid duplicate dictionary lookups and multiple computations of the header hash of blocks.
This makes the parts of block validation that currently only look in the cache, also able to fall back to pull blocks from the database. Using
async
functions instead of immediate ones.This makes the
find_fork_poin_in_chain()
async
and also able to find blocks in the database that aren't in the cache.Update the version of
block-cache
blocks we pull on CI, to use the new test chains.Optimize
find_fork_point_in_chain()
to mitigate some of the issue of CI taking too ling.Before initiating a sync, we collect peaks from at least 3 peers, unless that takes more than 30 seconds, in which case we start syncing anyway. This commit makes this timeout configurable, and configures it to 1 second in the tests. This shaves off a 30 second wait in the tests introduced in (5).
Current Behavior:
In a long/deep reorg, the node may fail with
KeyError
while looking up block records that are not in the cache.New Behavior:
Long/deep reorgs work as expected.
Testing Notes:
There are two new relevant tests added, one in
test_blockchain.py
and one intest_full_node.py
. These fail without the fix.