Failed block sync might leave some sync data in the main chain database #4866

SWvheerden · 2022-10-28T12:09:35Z

Currently, syncing is handled by the BN state machine. If the node things its behind, the node will try header_sync to download headers.
If it has enough VALID headers, it will commit them as headers on the main chain, and continue do to so for as long it is provided VALID headers.
If this fails, the node goes to waiting state.
It should not, it should either try and sync the VALID blocks from a node, or delete the headers and on.
But most cases this will return Ok

The block sync has the same problem, after failing, we need to check and see if we need to reset back to our previous state.

The text was updated successfully, but these errors were encountered:

SWvheerden · 2022-10-28T12:10:41Z

Rewind will store all the re-orged blocks in the orphan database.
We should just chose the highest VALID tip and make sure the main chain is that one

sdbondi · 2022-10-28T12:30:01Z

Initial reaction is to disagree. The chain metadata keeps track of the valid full block height. We already handle having a chain of valid headers but no/partial blocks. If we reset on an invalid header, we'll have to download valid headers all over again. Same for blocks. If a synced header/block is invalid after syncing others, the blocks that led up to them are still valid so why should we throw them away?

~~I dont think the rewinding puts the removed blocks back in the orphan database, only updates the orphan tip hash after a reorg, which doesnt seem right.~~ Ah I was mistaken, this happens inside rewind_to_height

tari/base_layer/core/src/chain_storage/blockchain_database.rs

Line 2004 in 09eda1b

if let Some(block) = removed_blocks.first() {

SWvheerden · 2022-10-28T12:34:32Z

I dont think the rewinding puts the removed blocks back in the orphan database, only updates the orphan tip hash after a reorg, which doesnt seem right.

It happens in rewind, not reorganise

tari/base_layer/core/src/chain_storage/blockchain_database.rs

Line 1725 in 09eda1b

txn.insert_chained_orphan(block.clone());

SWvheerden · 2022-10-28T12:39:29Z

Initial reaction is to disagree. The chain metadata keeps track of the valid full block height. We already handle having a chain of valid headers but no/partial blocks. If we reset on an invalid header, we'll have to download valid headers all over again. Same for blocks. If a synced header/block is invalid after syncing others, the blocks that led up to them are still valid so why should we throw them away?

The header sync does this correctly, it only "swaps" if the syncing chain has higher PoW and is Valid. It will even do that for a partial. Which is correct.

But going into a waiting state, can cause issues as now the database has incomplete state while accepting new blocks.
And in LMDB, we have orphan blocks -> ref via hash. But in the Main Chain, we have header -> height.

So what can happen? A header sync attempts a sync of a few blocks, but fails on connection. No the BN goes back to waiting state meaning it can accept blocks again. Now it accepts new blocks, for those same heights again...
And we have a broken state

SWvheerden · 2022-10-28T12:40:31Z

The same problem can be said for the block sync. We might download 1000 block headers, and all are valid.
But if the first block in that header is invalid, that chain will always be invalid, and we need to revert back to our old chain

sdbondi · 2022-10-28T12:45:48Z

When propagated blocks is active again, and receives a block, it should recognise that the block as an orphan, and not base anything on the headers - This does fit with the error we were seeing so definitely could be a bug there.

Agree with block sync that we need to remove invalid headers up to and after a failed block (mmrs are invalid). Maybe we need to restore the old state, but since we only change for a higher PoW I don't think that is critical.

stringhandler · 2022-10-31T12:50:12Z

I don't really understand the problem or solution, but happy if you want to leave this issue open and make a PR against it, showing the problem

Description --- If sync fails resets chain to the highest pow chain the node locally has the data to. Motivation and Context --- See: #4866 How Has This Been Tested? --- Unit tests

SWvheerden added this to Tari Esme Testnet Oct 28, 2022

SWvheerden moved this to Must Do in Tari Esme Testnet Oct 28, 2022

SWvheerden mentioned this issue Oct 28, 2022

KeyExists error "deleted_txo_mmr_position_to_height_index" #4833

Closed

SWvheerden mentioned this issue Oct 28, 2022

Base_node sync, should remove blocks from orphan db it syncs and not leave them in the orphan db #4867

Closed

stringhandler added the P-controversial Process - This PR or Issue is controversial and/or requires more attention that simpler issues label Oct 31, 2022

stringhandler added C-bug Category - fixes a bug, typically associated with an issue. A-base_node Area - The Tari base node executable and libraries labels Oct 31, 2022

stringhandler added this to the Stagenet Freeze milestone Oct 31, 2022

SWvheerden self-assigned this Nov 14, 2022

SWvheerden moved this from Must Do to In Progress in Tari Esme Testnet Nov 14, 2022

SWvheerden moved this from In Progress to In Review in Tari Esme Testnet Nov 17, 2022

SWvheerden mentioned this issue Nov 24, 2022

feat: reset broken sync #4955

Merged

stringhandler pushed a commit that referenced this issue Nov 28, 2022

feat: reset broken sync (#4955)

01e9e7e

Description --- If sync fails resets chain to the highest pow chain the node locally has the data to. Motivation and Context --- See: #4866 How Has This Been Tested? --- Unit tests

stringhandler moved this from In Review to Done in Tari Esme Testnet Nov 30, 2022

stringhandler closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed block sync might leave some sync data in the main chain database #4866

Failed block sync might leave some sync data in the main chain database #4866

SWvheerden commented Oct 28, 2022

SWvheerden commented Oct 28, 2022 •

edited

Loading

sdbondi commented Oct 28, 2022 •

edited

Loading

SWvheerden commented Oct 28, 2022

SWvheerden commented Oct 28, 2022

SWvheerden commented Oct 28, 2022

sdbondi commented Oct 28, 2022 •

edited

Loading

stringhandler commented Oct 31, 2022

Failed block sync might leave some sync data in the main chain database #4866

Failed block sync might leave some sync data in the main chain database #4866

Comments

SWvheerden commented Oct 28, 2022

SWvheerden commented Oct 28, 2022 • edited Loading

sdbondi commented Oct 28, 2022 • edited Loading

SWvheerden commented Oct 28, 2022

SWvheerden commented Oct 28, 2022

SWvheerden commented Oct 28, 2022

sdbondi commented Oct 28, 2022 • edited Loading

stringhandler commented Oct 31, 2022

SWvheerden commented Oct 28, 2022 •

edited

Loading

sdbondi commented Oct 28, 2022 •

edited

Loading

sdbondi commented Oct 28, 2022 •

edited

Loading