Fix deep reorgs #16594

arvidn · 2023-10-11T19:29:30Z

This PR should be reviewed one commit at a time.

Purpose:

The node is currently having a hard time with deep reorgs. The issue can be split in 2 parts:

Some parts of block validation require some previous blocks to be loaded in the block record cache in the blockchain object. When performing a reorg deeper than the cache size, those block records may not be in the cache and fail.
When validating coin spends on the forked chain, we re-run blocks in order to compute the additions and removals for the fork of the chain. This is done very inefficiently, causing the same blocks to be run many times.

When validating blocks on the fork of the chain (before accepting it and updating our peak) we need to be able to pick up block records (and generators) from blocks that are not in the main chain. This requires looking them up from the database. One case of interest, that has previously not been covered by a test, is when a block reference points into the forked chain (i.e. a block not in the main chain).

Current Behavior:

In a long/deep reorg, the node may fail with KeyError while looking up block records that are not in the cache.

New Behavior:

Long/deep reorgs work as expected.

Changes

These are all the commits in this PR.

test block chains, BlockTools updates

Update BlockTools to make reorg test blockchains use block references and transactions. These updated test chains have been put in the test-cache repo and released as 0.38.0.

add feature to BlockTools to allow including block references to previous generators in test chains. Use a longer chain, with block generators, in the long_reorg test add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak
add feature to BlockTools to include transactions in every transaction block, to exercise more parts of our validation logic in simulations

fall back to pulling block records from DB

Allow block validation to pull blocks from the database, that currently only pull from the cache.

update block validation to pull block dependencies from the database
transition find_fork_point_in_chain() to be async and pull blocks from the database

Preserve fork state across blocks, when validating

Preserve the state of the fork across block validations. This is done in ForkInfo, by preserving all additions and removals of the fork. The "fix reorg performance" commit is the big one that passes around information about the fork that's currently being validated (if any).

introduce new function, lookup_fork_chain(), similar to find_fork_point_in_chain(), but return the whole sequence of block hashes
optimize find_fork_point_in_chain() to just look up prev_hash from the database, avoiding parsing and constructing a whole BlockRecord object for block traversed
fix reorg performance issue by preserving additions and removals across block validations (instead of re-computing them from scratch for every block)
update get_block_generator() to use new lookup_fork_chain()

Update and add new tests

Extend the existing long reorg test to be even longer (to be deeper than the block cache size).
Also add tests on the FullNode object including one with two nodes syncing via the network protocol.

update test_long_reorg to use the new, longer, test chains with block generators and block references
add new test_long_reorg operating on full node
update test with two nodes with a deep fork, where one syncs to the other
~~make the timeout for collecting 3 peaks configurable. Set it to 1 seconds in the simulation tests to avoid waiting on CI~~ (already addressed by Sync no farmer #16698 and make sure we set max_sync_wait to 0 in all tests #16771 )
bump test chains used on CI

Benchmark

I profiled the new reorg logic via test_long_reorg_nodes() and the Blockchain.py test_long_reorg() tests.

test_long_reorg_nodes()

test_long_reorg()

future improvements

get_block_generator()

There are still some significant performance improvements to be made to the get_block_generator() function. If the fork chain's height-to-hash map would be recorded in the ForkInfo object, it could avoid the (expensive) call to lookup_fork_chain() entirely.

_reconsider_peak()

This function pulls in all full blocks of the fork chain in order to apply them one at a time. At this point we have access to the ForkInfo object, so we could already know the block hashes to apply. Then we could load them one at a time from the database, and apply them one at a time.

It appears we mainly build the blocks_to_add up front to be able to reverse the order of blocks. However, since we also have all the additions and removals, we don't technically need to re-run the blocks either at this point. But we would need to record the height for removals in that case.

The loop to fetch the full blocks and block records from the DB is bottlenecked on the parsing, and DB access.

There's some work started to address this:
#16787
#16793

protocol error when syncing

This test: tests/core/full_node/test_full_node.py::test_long_reorg_nodes exposes an issue where a node attempts to fetch a block at a height that doesn't exist. This happens when a node is exposed to a heavier chain with a lower peak height. It seems like we try to request blocks at our peak height, rather than the peer's peak height.

This is visible in the log as:

22:22:05 chia.full_node.full_node: INFO 🌱 Updated peak to height 1499, weight 1957496, hh 132e925cd2b6a639e67c1afefedb307a796bcb67fed780a294e2e2ab8ba5feeb, forked at 1498, rh: 7440aff5006463202dbebaa0f6a07710b8601717ba6cb7983532c38fae99a6fb, total iters: 232005, overflow: True, deficit: 12, difficulty: 2400, sub slot iters: 8000, Generator size: 1287, Generator ref list size: 3
...
22:22:05 chia.full_node.full_node: INFO Starting batch short sync from 1593 to height 1499
22:22:05 chia.full_node.full_node: ERROR Error short batch syncing, could not fetch block at height 1593

This is addressed by: #16779

overlapping block requests

When syncing, it appears we make requests for overlapping block ranges. When this is printed to the log:

22:22:07 chia.full_node.full_node: INFO Added blocks 0 to 32
22:22:07 chia.full_node.full_node: INFO Added blocks 32 to 64
22:22:07 chia.full_node.full_node: INFO Added blocks 64 to 96
22:22:07 chia.full_node.full_node: INFO Added blocks 96 to 128

Blocks 32, 64, 96 are requested and added twice from the peer. This seems to be a mistake.

This is addressed by: #16792

throwing `get_peak()`

In the test we wait for all nodes to have the same peak, indicating they're synced. Sometimes get_peak() throws:

Traceback (most recent call last):
  File "/home/arvid/dev/3/chia-blockchain/tests/core/full_node/test_full_node.py", line 2195, in check_nodes_in_sync
    p2 = full_node_1.full_node.blockchain.get_peak()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arvid/dev/3/chia-blockchain/chia/consensus/blockchain.py", line 191, in get_peak
    return self.height_to_block_record(self._peak_height)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arvid/dev/3/chia-blockchain/chia/consensus/blockchain.py", line 697, in height_to_block_record
    if header_hash is None:
        ^^^^^^^^^^^^^^^^^^^^
ValueError: Height is not in blockchain: 1599

This is addressed by: #16776

invalid weight proof in test

~~When enabling the shorter chains in tests/core/full_node/test_full_node.py::test_long_reorg_nodes, some combinations fail with the following error log:~~

22:41:49 chia.types.blockchain_format.proof_of_space: ERROR Calculated pos challenge doesn't match the provided one
22:41:49 chia.full_node.weight_proof: ERROR could not verify proof of space
22:41:49 chia.full_node.weight_proof: ERROR failed to validate sub_epoch 2 segment 0 slots
22:41:49 full_node_server: WARNING Trying to ban localhost for 600, but will not ban

~~followed by:~~

22:41:49 chia.full_node.full_node: ERROR Error with syncing: <class 'ValueError'>Traceback (most recent call last):
  File "/Users/arvid/Documents/dev/2/chia-blockchain/chia/full_node/full_node.py", line 975, in _sync
    fork_point, summaries = await self.request_validate_wp(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/arvid/Documents/dev/2/chia-blockchain/chia/full_node/full_node.py", line 1040, in request_validate_wp
    raise ValueError("Weight proof validation failed")
ValueError: Weight proof validation failed

~~It would be good to understand why.~~

coveralls-official · 2023-10-12T10:10:43Z

Pull Request Test Coverage Report for Build 6866521658

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

614 of 693 (88.6%) changed or added relevant lines in 19 files are covered.
51 unchanged lines in 11 files lost coverage.
Overall coverage decreased (-0.05%) to 90.22%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tests/plot_sync/util.py	1	2	50.0%
chia/consensus/blockchain_interface.py	2	4	50.0%
chia/wallet/wallet_blockchain.py	9	11	81.82%
chia/consensus/blockchain.py	120	124	96.77%
chia/full_node/full_node.py	28	32	87.5%
chia/util/block_cache.py	2	7	28.57%
chia/simulator/block_tools.py	17	78	21.79%

Files with Coverage Reduction	New Missed Lines	%
chia/data_layer/dl_wallet_store.py	1	95.74%
chia/full_node/weight_proof.py	1	91.04%
chia/types/header_block.py	1	96.15%
tests/plotting/test_plot_manager.py	1	98.06%
chia/timelord/timelord.py	2	73.64%
chia/plotters/madmax.py	3	48.77%
tests/core/util/test_lockfile.py	5	88.66%
chia/full_node/full_node.py	7	84.54%
chia/server/address_manager.py	7	90.36%
chia/wallet/wallet_node.py	8	86.83%

Totals
Change from base Build 6865805050:	-0.05%
Covered Lines:	93389
Relevant Lines:	103455

💛 - Coveralls

arvidn · 2023-10-23T08:49:26Z

We have completed a sync-from-scratch test on mainnet with this patch

tests/blockchain/test_blockchain.py

tests/core/full_node/test_full_node.py

github-actions · 2023-11-14T14:01:40Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wjblanke

aok

…ious generators in test chains. Use a longer chain, with block generators, in the long_reorg test add new test block chain fixture for reorgs with lighter weight blocks, this covers reorgs to greater block heights than the current peak

…n block, to exercise more parts of our validation logic in simulations

…m the database

…nt_in_chain(), but return the whole sequence of block hashes

…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed

…ss block validations (instead of re-computing them from scratch for every block)

… generators and block references

…y, we don't get here

github-actions · 2023-11-14T16:24:39Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

github-actions · 2023-11-14T19:52:17Z

File	Coverage	Missing Lines
`chia/consensus/blockchain.py`	96.8%	lines 300, 314, 317, 1143
`chia/consensus/blockchain_interface.py`	50.0%	lines 45, 68
`chia/full_node/full_node.py`	87.5%	lines 1126-1128, 1256
`chia/simulator/block_tools.py`	21.8%	lines 197-205, 207, 209, 622-624, 638-640, 642-646, 650-652, 787-791, 829-831, 833-834, 840-841, 896-901, 1069-1073, 1153-1155, 1157-1158, 1164-1165, 1224-1229
`chia/util/block_cache.py`	28.6%	lines 63, 81-84
`chia/wallet/wallet_blockchain.py`	81.8%	lines 199, 216
`tests/plot_sync/util.py`	50.0%	lines 32

Total	Missing	Coverage
693 lines	79 lines	88%

wjblanke

aok

arvidn added the Fixed Required label for PR that categorizes merge commit message as "Fixed" for changelog label Oct 11, 2023

arvidn requested review from wjblanke, almogdepaz and emlowe October 11, 2023 19:30

arvidn force-pushed the fix-deep-reorgs branch 3 times, most recently from ef40f13 to 1386e48 Compare October 12, 2023 00:30

github-actions bot added the coverage-diff label Oct 12, 2023

arvidn force-pushed the fix-deep-reorgs branch from 1386e48 to 4153352 Compare October 12, 2023 11:17

arvidn marked this pull request as ready for review October 12, 2023 14:04

arvidn requested a review from a team as a code owner October 12, 2023 14:04

github-actions bot added coverage-diff and removed coverage-diff labels Oct 12, 2023

arvidn force-pushed the fix-deep-reorgs branch from 92cc7c8 to 124c396 Compare October 15, 2023 19:45

github-actions bot added coverage-diff and removed coverage-diff labels Oct 15, 2023

arvidn force-pushed the fix-deep-reorgs branch from 124c396 to db8a294 Compare October 17, 2023 11:13

github-actions bot added coverage-diff and removed coverage-diff labels Oct 18, 2023

arvidn force-pushed the fix-deep-reorgs branch from 8925f25 to 20ca793 Compare October 18, 2023 20:01

arvidn changed the base branch from release/2.1.2 to main October 18, 2023 20:02

arvidn force-pushed the fix-deep-reorgs branch from 20ca793 to de1d389 Compare October 19, 2023 15:07

github-actions bot added coverage-diff and removed coverage-diff labels Oct 19, 2023

arvidn commented Oct 24, 2023

View reviewed changes

tests/blockchain/test_blockchain.py Show resolved Hide resolved

arvidn commented Oct 24, 2023

View reviewed changes

tests/blockchain/test_blockchain.py Show resolved Hide resolved

arvidn commented Oct 24, 2023

View reviewed changes

tests/core/full_node/test_full_node.py Show resolved Hide resolved

arvidn commented Oct 24, 2023

View reviewed changes

tests/core/full_node/test_full_node.py Show resolved Hide resolved

wjblanke previously approved these changes Nov 14, 2023

View reviewed changes

arvidn added 14 commits November 14, 2023 17:23

add feature to BlockTools to include transactions in every transactio…

9713563

…n block, to exercise more parts of our validation logic in simulations

update block validation to pull block dependencies from the database

27f1ef9

transition find_fork_point_in_chain() to be async and pull blocks fro…

0c299c1

…m the database

introduce new function, lookup_fork_chain(), similar to find_fork_poi…

43b5fff

…nt_in_chain(), but return the whole sequence of block hashes

optimize find_fork_point_in_chain() to just look up prev_hash from th…

e42f77c

…e database, avoiding parsing and constructing a whole BlockRecord object for block traversed

fix reorg performance issue by preserving additions and removals acro…

d9f0a0f

…ss block validations (instead of re-computing them from scratch for every block)

update get_block_generator() to use new lookup_fork_chain()

482d1ba

update test_long_reorg to use the new, longer, test chains with block…

d72c030

… generators and block references

add new test_long_reorg operating on full node

9ec2252

add test with two nodes with a deep fork, where one syncs to the other

e44ab03

log when entering the slow-path of validating orphaned blocks. Ideall…

10fc08e

…y, we don't get here

add comments

8c5a374

bump test chains used on CI

e277a4f

arvidn dismissed wjblanke’s stale review via e277a4f November 14, 2023 16:24

arvidn force-pushed the fix-deep-reorgs branch from 8dae620 to e277a4f Compare November 14, 2023 16:24

github-actions bot removed the merge_conflict Branch has conflicts that prevent merge to main label Nov 14, 2023

github-actions bot removed the coverage-diff label Nov 14, 2023

github-actions bot added the coverage-diff label Nov 14, 2023

arvidn requested a review from wjblanke November 14, 2023 20:48

wjblanke approved these changes Nov 15, 2023

View reviewed changes

arvidn added ready_to_merge Submitter and reviewers think this is ready and removed coverage-diff labels Nov 15, 2023

cmmarslender merged commit 038f5e8 into main Nov 15, 2023
254 of 255 checks passed

cmmarslender deleted the fix-deep-reorgs branch November 15, 2023 16:51

arvidn mentioned this pull request Oct 18, 2024

[CHIA-1566] pipeline block validation in sync_from_fork_point() #18703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deep reorgs #16594

Fix deep reorgs #16594

arvidn commented Oct 11, 2023 •

edited

Loading

coveralls-official bot commented Oct 12, 2023 •

edited

Loading

arvidn commented Oct 23, 2023

github-actions bot commented Nov 14, 2023

wjblanke left a comment

github-actions bot commented Nov 14, 2023

github-actions bot commented Nov 14, 2023

wjblanke left a comment

Fix deep reorgs #16594

Fix deep reorgs #16594

Conversation

arvidn commented Oct 11, 2023 • edited Loading

Purpose:

Current Behavior:

New Behavior:

Changes

test block chains, BlockTools updates

fall back to pulling block records from DB

Preserve fork state across blocks, when validating

Update and add new tests

Benchmark

test_long_reorg_nodes()

test_long_reorg()

future improvements

get_block_generator()

_reconsider_peak()

protocol error when syncing

overlapping block requests

throwing get_peak()

invalid weight proof in test

coveralls-official bot commented Oct 12, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6866521658

Warning: This coverage report may be inaccurate.

💛 - Coveralls

arvidn commented Oct 23, 2023

github-actions bot commented Nov 14, 2023

wjblanke left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 14, 2023

github-actions bot commented Nov 14, 2023

wjblanke left a comment

Choose a reason for hiding this comment

arvidn commented Oct 11, 2023 •

edited

Loading

throwing `get_peak()`

coveralls-official bot commented Oct 12, 2023 •

edited

Loading