Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Continuity Check Script to celo-migrate #282

Open
wants to merge 12 commits into
base: celo10
Choose a base branch
from

Conversation

alecps
Copy link

@alecps alecps commented Dec 9, 2024

Adds a standalone command for checking db continuity without performing a migration.

Unit tests are added for the CheckContinuity function

The continuity script can be run in "fail-fast" mode to quickly check whether the db is corrupted, or run normally to print out every gap detected in the data.

Drive by changes:

  • Adds fixes for a couple minor indexing errors
  • Typos
  • Fixes a couple error checks (fixing these types of error checks was part of the audit feedback, but I found a couple that were missed)

Fixes https://github.com/celo-org/celo-blockchain-planning/issues/832

Note: This PR originally added the continuity check logic to the actual migration code path (in addition to having a standalone script), but in an effort to reduce unnecessary changes to the migration logic those changes were reverted and moved to #297. This means that the migration script itself will not detect any gaps in the data being migrated, so the standalone continuity script should be run on the source db before running the migration script.

Copy link

@palango palango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I wonder if we should have a separate command to check a datadir before the migration, independent of the pre-migration. What do you think?

op-chain-ops/cmd/celo-migrate/ancients.go Outdated Show resolved Hide resolved
op-chain-ops/cmd/celo-migrate/ancients.go Outdated Show resolved Hide resolved
op-chain-ops/cmd/celo-migrate/non-ancients.go Outdated Show resolved Hide resolved
@alecps alecps changed the title check for gaps in block numbers and throw if found during migration Check for gaps in block numbers and throw if found during migration Dec 10, 2024
@alecps alecps force-pushed the alecps/migrationBlockGaps branch from 6842c02 to 8c18e28 Compare December 10, 2024 20:07
@alecps
Copy link
Author

alecps commented Dec 10, 2024

@palango Interesting idea. I need to think about that a bit more. My first reaction is that it would add an extra step + time to the migration process. I'm also trying to work out if it's even possible to have gaps in ancient blocks, as the ancient db is append only. Having a separate script could be a quick way to bring extra piece of mind. Will give this some more thought tomorrow

@palango
Copy link

palango commented Dec 11, 2024

@palango Interesting idea. I need to think about that a bit more. My first reaction is that it would add an extra step + time to the migration process. I'm also trying to work out if it's even possible to have gaps in ancient blocks, as the ancient db is append only. Having a separate script could be a quick way to bring extra piece of mind. Will give this some more thought tomorrow

Not a requirement, but it might be too late for a full resync if people run the pre-migration shortly before the schedules block.

@piersy
Copy link

piersy commented Dec 11, 2024

My feeling is that a separate command for this is unnecessary, if we want people to check this earlier then we should just message them to run the pre-migration earlier.

@alecps alecps marked this pull request as draft December 12, 2024 20:22
@alecps
Copy link
Author

alecps commented Dec 12, 2024

I've observed some strange behavior that I want to look into more. It seems there may be an earlier gap that this branch does not detect that is preventing blocks from freezing beyond block 5,002,000. This branch only throws on a later gap that starts at block 5835918. It's possible this is a partial gap, where only receipt data is missing. Will re-open this for review when I've gotten to the bottom of what's going on

@alecps alecps force-pushed the alecps/migrationBlockGaps branch from 3bb956f to 66c7b86 Compare December 18, 2024 03:06
@alecps alecps marked this pull request as ready for review December 18, 2024 03:07
@alecps alecps requested review from piersy and palango December 18, 2024 03:07
@alecps
Copy link
Author

alecps commented Dec 18, 2024

I added checks for all the necessary data (receipts, headers, total difficulty, etc.) for both ancient and non-ancient blocks. The script before only checked for gaps in headers. I haven't been able to test with gaps in ancients because I haven't been able to get a node to freeze blockls beyond the first gap. It's unclear whether it's possible to actually have a gap in the ancient data.

@alecps
Copy link
Author

alecps commented Dec 18, 2024

@palango After thinking about it more, I do actually think it would be nice to have a command for testing data integrity. Though it isn't strictly necessary, it might be appreciated by anyone who runs into data corruption issues and wants to quickly check whether the database they're using is okay before trying a second time. We could even consider recommending that people run it first as a precaution, but that might not be necessary. I think it might make sense to add in a follow up PR so that this one doesn't get too big. What do you think?

@palango
Copy link

palango commented Dec 18, 2024

@palango After thinking about it more, I do actually think it would be nice to have a command for testing data integrity. Though it isn't strictly necessary, it might be appreciated by anyone who runs into data corruption issues and wants to quickly check whether the database they're using is okay before trying a second time. We could even consider recommending that people run it first as a precaution, but that might not be necessary. I think it might make sense to add in a follow up PR so that this one doesn't get too big. What do you think?

Totally agree, this can be added separately. And maybe this doesn't need to be coupled to the migration at all, even though it would make sense to reuse code if possible.

@alecps alecps marked this pull request as draft December 19, 2024 22:40
Copy link

github-actions bot commented Jan 4, 2025

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 4, 2025
@palango palango removed the Stale label Jan 6, 2025
@alecps alecps requested a review from piersy January 17, 2025 22:52
Comment on lines 43 to 45
bodies: [][]byte{[]byte("body0"), []byte("body1"), []byte("body2"), []byte("body3")},
receipts: [][]byte{[]byte("receipt0"), []byte("receipt1"), []byte("receipt2"), []byte("receipt3")},
tds: [][]byte{[]byte("td0"), []byte("td1"), []byte("td2"), []byte("td3")},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding all these fields here is verbose and confusing because it makes it look like the test is checking some of these values when it is not.

I suggest using a function like makeRange(start int, hashes, encodedHeaders [][]byte) that ensures that all other fields have the correct length

Copy link
Author

@alecps alecps Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let me know if what I have now looks better to you. I used a makeRange helper to make the tests more concise, but since many of the tests are specifically checking for differences in the lengths of the various other fields, I'm still passing the slices in to the helper for all the fields (not just hashes and encodedHeaders). This ended up looking easier to read in my opinion than passing the desired lengths as int parameters. I can't see how the parameters you suggested above "makeRange(start int, hashes, encodedHeaders [][]byte)" could support the test cases where we differ the lengths of the various fields. Let me know if you had something else in mind!

@piersy
Copy link

piersy commented Jan 22, 2025

Hey @alecps, I see what you're trying to do here, but this has become quite huge, 861 lines to check that each block follows it's parent.

The db continuity check runs pretty fast for me 1 minute for alfajores, so I would be in favour of reverting changes to the pre-existing migration code and just adding the db check as a separate command. This gives us the benefit of not having to modify the audited migration code. I don't want to be unilaterally deciding this though so please chime in @karlb & @palango

@alecps alecps force-pushed the alecps/migrationBlockGaps branch 2 times, most recently from 1715446 to 6aaf1e3 Compare January 22, 2025 23:14
@alecps alecps changed the title Check for gaps in block numbers and throw if found during migration Add Continuity Check Script to celo-migrate Jan 22, 2025
@@ -216,7 +216,7 @@ func writeAncientBlocks(ctx context.Context, freezer *rawdb.Freezer, in <-chan R
return fmt.Errorf("failed to write block range: %w", err)
}
blockRangeEnd := blockRange.start + uint64(len(blockRange.hashes)) - 1
log.Info("Wrote ancient blocks", "start", blockRange.start, "end", blockRangeEnd, "count", len(blockRange.hashes), "remaining", totalAncientBlocks-blockRangeEnd)
log.Info("Wrote ancient blocks", "start", blockRange.start, "end", blockRangeEnd, "count", len(blockRange.hashes), "remaining", totalAncientBlocks-(blockRangeEnd+1))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indexing error was missed by the audit

@@ -61,7 +92,7 @@ func openDB(chaindataPath string, readOnly bool) (ethdb.Database, error) {

// Opens a database without access to AncientsDb
func openDBWithoutFreezer(chaindataPath string, readOnly bool) (ethdb.Database, error) {
if _, err := os.Stat(chaindataPath); errors.Is(err, os.ErrNotExist) {
if _, err := os.Stat(chaindataPath); err != nil {
Copy link
Author

@alecps alecps Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this type of error check was part of the audit feedback. I fixed several others in this PR, but somehow missed this one.

@alecps
Copy link
Author

alecps commented Jan 23, 2025

@piersy Okay I've reverted all the changes to the migration code path. What remains is just the code for the continuity script, the unit tests for CheckContinuity, a fix for an indexing error, and some other super minor drive-by changes.

I've moved the rest of the changes that weave the continuity checks into the migration logic to alecps/addContinuityCheckToMigration

@alecps alecps force-pushed the alecps/migrationBlockGaps branch from a10e683 to 1e8c9cc Compare January 23, 2025 19:25
for i := startBlock; i < endBlock; i += batchSize {
count := min(batchSize, endBlock-i+1)
count := min(batchSize, endBlock-i)
Copy link
Author

@alecps alecps Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audit missed this indexing error. It should be endBlock - i because endBlock is not inclusive. This didn't cause any problems before because the freezer.AncientRange function returns at most count elements, and this indexing error could only cause a problem when reading the last section of the ancients (when the number of remaining blocks is less than batchSize)

var err error

blockRange.hashes, err = freezer.AncientRange(rawdb.ChainFreezerHashTable, start, count, 0)
blockRange, err := loadAncientRange(freezer, start, count)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piersy This is the only refactor to the migration code path from the audited code that I haven't reverted. It's straightforward enough that I thought it might be okay to keep, since we need to use loadAncientRange in the continuity script. Happy to revert it though if you think that's best

@alecps alecps requested a review from piersy January 23, 2025 21:23
// Follows checks if the current block has a number one greater than the previous block
// and if the parent hash of the current block matches the hash of the previous block.
func (e *RLPBlockElement) Follows(prev *RLPBlockElement) (err error) {
if e.Header().Number.Uint64() != prev.Header().Number.Uint64()+1 {
Copy link

@palango palango Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 12 instances of .Header().Number.Uint64(), maybe worth putting this in a small function for legibility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants