Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deep reorgs #16594

Merged
merged 14 commits into from
Nov 15, 2023
Merged

Fix deep reorgs #16594

merged 14 commits into from
Nov 15, 2023

Conversation

arvidn
Copy link
Contributor

@arvidn arvidn commented Oct 11, 2023

This PR should be reviewed one commit at a time.

Purpose:

The node is currently having a hard time with deep reorgs. The issue can be split in 2 parts:

  1. Some parts of block validation require some previous blocks to be loaded in the block record cache in the blockchain object. When performing a reorg deeper than the cache size, those block records may not be in the cache and fail.

  2. When validating coin spends on the forked chain, we re-run blocks in order to compute the additions and removals for the fork of the chain. This is done very inefficiently, causing the same blocks to be run many times.

When validating blocks on the fork of the chain (before accepting it and updating our peak) we need to be able to pick up block records (and generators) from blocks that are not in the main chain. This requires looking them up from the database. One case of interest, that has previously not been covered by a test, is when a block reference points into the forked chain (i.e. a block not in the main chain).

Current Behavior:

In a long/deep reorg, the node may fail with KeyError while looking up block records that are not in the cache.

New Behavior:

Long/deep reorgs work as expected.

Changes

These are all the commits in this PR.

test block chains, BlockTools updates

Update BlockTools to make reorg test blockchains use block references and transactions. These updated test chains have been put in the test-cache repo and released as 0.38.0.

  • add feature to BlockTools to allow including block references to previous generators in test chains. Use a longer chain, with block generators, in the long_reorg test add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak

  • add feature to BlockTools to include transactions in every transaction block, to exercise more parts of our validation logic in simulations

fall back to pulling block records from DB

Allow block validation to pull blocks from the database, that currently only pull from the cache.

  • update block validation to pull block dependencies from the database
  • transition find_fork_point_in_chain() to be async and pull blocks from the database

Preserve fork state across blocks, when validating

Preserve the state of the fork across block validations. This is done in ForkInfo, by preserving all additions and removals of the fork. The "fix reorg performance" commit is the big one that passes around information about the fork that's currently being validated (if any).

  • introduce new function, lookup_fork_chain(), similar to find_fork_point_in_chain(), but return the whole sequence of block hashes
  • optimize find_fork_point_in_chain() to just look up prev_hash from the database, avoiding parsing and constructing a whole BlockRecord object for block traversed
  • fix reorg performance issue by preserving additions and removals across block validations (instead of re-computing them from scratch for every block)
  • update get_block_generator() to use new lookup_fork_chain()

Update and add new tests

Extend the existing long reorg test to be even longer (to be deeper than the block cache size).
Also add tests on the FullNode object including one with two nodes syncing via the network protocol.

  • update test_long_reorg to use the new, longer, test chains with block generators and block references
  • add new test_long_reorg operating on full node
  • update test with two nodes with a deep fork, where one syncs to the other
  • make the timeout for collecting 3 peaks configurable. Set it to 1 seconds in the simulation tests to avoid waiting on CI (already addressed by Sync no farmer #16698 and make sure we set max_sync_wait to 0 in all tests #16771 )
  • bump test chains used on CI

Benchmark

I profiled the new reorg logic via test_long_reorg_nodes() and the Blockchain.py test_long_reorg() tests.

test_long_reorg_nodes()

full-node-test-long-reorg

test_long_reorg()

blockchain-test-long-reorg

future improvements

get_block_generator()

There are still some significant performance improvements to be made to the get_block_generator() function. If the fork chain's height-to-hash map would be recorded in the ForkInfo object, it could avoid the (expensive) call to lookup_fork_chain() entirely.

_reconsider_peak()

This function pulls in all full blocks of the fork chain in order to apply them one at a time. At this point we have access to the ForkInfo object, so we could already know the block hashes to apply. Then we could load them one at a time from the database, and apply them one at a time.

It appears we mainly build the blocks_to_add up front to be able to reverse the order of blocks. However, since we also have all the additions and removals, we don't technically need to re-run the blocks either at this point. But we would need to record the height for removals in that case.

The loop to fetch the full blocks and block records from the DB is bottlenecked on the parsing, and DB access.

reorg-fetch-blocks-to-add

There's some work started to address this:
#16787
#16793

protocol error when syncing

This test: tests/core/full_node/test_full_node.py::test_long_reorg_nodes exposes an issue where a node attempts to fetch a block at a height that doesn't exist. This happens when a node is exposed to a heavier chain with a lower peak height. It seems like we try to request blocks at our peak height, rather than the peer's peak height.

This is visible in the log as:

22:22:05 chia.full_node.full_node: INFO 🌱 Updated peak to height 1499, weight 1957496, hh 132e925cd2b6a639e67c1afefedb307a796bcb67fed780a294e2e2ab8ba5feeb, forked at 1498, rh: 7440aff5006463202dbebaa0f6a07710b8601717ba6cb7983532c38fae99a6fb, total iters: 232005, overflow: True, deficit: 12, difficulty: 2400, sub slot iters: 8000, Generator size: 1287, Generator ref list size: 3
...
22:22:05 chia.full_node.full_node: INFO Starting batch short sync from 1593 to height 1499
22:22:05 chia.full_node.full_node: ERROR Error short batch syncing, could not fetch block at height 1593

This is addressed by: #16779

overlapping block requests

When syncing, it appears we make requests for overlapping block ranges. When this is printed to the log:

22:22:07 chia.full_node.full_node: INFO Added blocks 0 to 32
22:22:07 chia.full_node.full_node: INFO Added blocks 32 to 64
22:22:07 chia.full_node.full_node: INFO Added blocks 64 to 96
22:22:07 chia.full_node.full_node: INFO Added blocks 96 to 128

Blocks 32, 64, 96 are requested and added twice from the peer. This seems to be a mistake.

This is addressed by: #16792

throwing get_peak()

In the test we wait for all nodes to have the same peak, indicating they're synced. Sometimes get_peak() throws:

Traceback (most recent call last):
  File "/home/arvid/dev/3/chia-blockchain/tests/core/full_node/test_full_node.py", line 2195, in check_nodes_in_sync
    p2 = full_node_1.full_node.blockchain.get_peak()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arvid/dev/3/chia-blockchain/chia/consensus/blockchain.py", line 191, in get_peak
    return self.height_to_block_record(self._peak_height)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arvid/dev/3/chia-blockchain/chia/consensus/blockchain.py", line 697, in height_to_block_record
    if header_hash is None:
        ^^^^^^^^^^^^^^^^^^^^
ValueError: Height is not in blockchain: 1599

This is addressed by: #16776

invalid weight proof in test

When enabling the shorter chains in tests/core/full_node/test_full_node.py::test_long_reorg_nodes, some combinations fail with the following error log:

22:41:49 chia.types.blockchain_format.proof_of_space: ERROR Calculated pos challenge doesn't match the provided one
22:41:49 chia.full_node.weight_proof: ERROR could not verify proof of space
22:41:49 chia.full_node.weight_proof: ERROR failed to validate sub_epoch 2 segment 0 slots
22:41:49 full_node_server: WARNING Trying to ban localhost for 600, but will not ban

followed by:

22:41:49 chia.full_node.full_node: ERROR Error with syncing: <class 'ValueError'>Traceback (most recent call last):
  File "/Users/arvid/Documents/dev/2/chia-blockchain/chia/full_node/full_node.py", line 975, in _sync
    fork_point, summaries = await self.request_validate_wp(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/arvid/Documents/dev/2/chia-blockchain/chia/full_node/full_node.py", line 1040, in request_validate_wp
    raise ValueError("Weight proof validation failed")
ValueError: Weight proof validation failed

It would be good to understand why.

@arvidn arvidn added the Fixed Required label for PR that categorizes merge commit message as "Fixed" for changelog label Oct 11, 2023
@arvidn arvidn force-pushed the fix-deep-reorgs branch 3 times, most recently from ef40f13 to 1386e48 Compare October 12, 2023 00:30
@coveralls-official
Copy link

coveralls-official bot commented Oct 12, 2023

Pull Request Test Coverage Report for Build 6866521658

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 614 of 693 (88.6%) changed or added relevant lines in 19 files are covered.
  • 51 unchanged lines in 11 files lost coverage.
  • Overall coverage decreased (-0.05%) to 90.22%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tests/plot_sync/util.py 1 2 50.0%
chia/consensus/blockchain_interface.py 2 4 50.0%
chia/wallet/wallet_blockchain.py 9 11 81.82%
chia/consensus/blockchain.py 120 124 96.77%
chia/full_node/full_node.py 28 32 87.5%
chia/util/block_cache.py 2 7 28.57%
chia/simulator/block_tools.py 17 78 21.79%
Files with Coverage Reduction New Missed Lines %
chia/data_layer/dl_wallet_store.py 1 95.74%
chia/full_node/weight_proof.py 1 91.04%
chia/types/header_block.py 1 96.15%
tests/plotting/test_plot_manager.py 1 98.06%
chia/timelord/timelord.py 2 73.64%
chia/plotters/madmax.py 3 48.77%
tests/core/util/test_lockfile.py 5 88.66%
chia/full_node/full_node.py 7 84.54%
chia/server/address_manager.py 7 90.36%
chia/wallet/wallet_node.py 8 86.83%
Totals Coverage Status
Change from base Build 6865805050: -0.05%
Covered Lines: 93389
Relevant Lines: 103455

💛 - Coveralls

@arvidn
Copy link
Contributor Author

arvidn commented Oct 23, 2023

We have completed a sync-from-scratch test on mainnet with this patch

Copy link
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wjblanke
wjblanke previously approved these changes Nov 14, 2023
Copy link
Contributor

@wjblanke wjblanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aok

…ious generators in test chains. Use a longer chain, with block generators, in the long_reorg test

add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak
…n block, to exercise more parts of our validation logic in simulations
…nt_in_chain(), but return the whole sequence of block hashes
…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed
…ss block validations (instead of re-computing them from scratch for every block)
@github-actions github-actions bot removed the merge_conflict Branch has conflicts that prevent merge to main label Nov 14, 2023
Copy link
Contributor

Conflicts have been resolved. A maintainer will review the pull request shortly.

Copy link
Contributor

Copy link
Contributor

@wjblanke wjblanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aok

@arvidn arvidn added ready_to_merge Submitter and reviewers think this is ready and removed coverage-diff labels Nov 15, 2023
@cmmarslender cmmarslender merged commit 038f5e8 into main Nov 15, 2023
254 of 255 checks passed
@cmmarslender cmmarslender deleted the fix-deep-reorgs branch November 15, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fixed Required label for PR that categorizes merge commit message as "Fixed" for changelog ready_to_merge Submitter and reviewers think this is ready
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants