-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Continuity Check Script to celo-migrate #282
base: celo10
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I wonder if we should have a separate command to check a datadir before the migration, independent of the pre-migration. What do you think?
6842c02
to
8c18e28
Compare
@palango Interesting idea. I need to think about that a bit more. My first reaction is that it would add an extra step + time to the migration process. I'm also trying to work out if it's even possible to have gaps in ancient blocks, as the ancient db is append only. Having a separate script could be a quick way to bring extra piece of mind. Will give this some more thought tomorrow |
Not a requirement, but it might be too late for a full resync if people run the pre-migration shortly before the schedules block. |
My feeling is that a separate command for this is unnecessary, if we want people to check this earlier then we should just message them to run the pre-migration earlier. |
I've observed some strange behavior that I want to look into more. It seems there may be an earlier gap that this branch does not detect that is preventing blocks from freezing beyond block |
3bb956f
to
66c7b86
Compare
I added checks for all the necessary data (receipts, headers, total difficulty, etc.) for both ancient and non-ancient blocks. The script before only checked for gaps in headers. I haven't been able to test with gaps in ancients because I haven't been able to get a node to freeze blockls beyond the first gap. It's unclear whether it's possible to actually have a gap in the ancient data. |
@palango After thinking about it more, I do actually think it would be nice to have a command for testing data integrity. Though it isn't strictly necessary, it might be appreciated by anyone who runs into data corruption issues and wants to quickly check whether the database they're using is okay before trying a second time. We could even consider recommending that people run it first as a precaution, but that might not be necessary. I think it might make sense to add in a follow up PR so that this one doesn't get too big. What do you think? |
Totally agree, this can be added separately. And maybe this doesn't need to be coupled to the migration at all, even though it would make sense to reuse code if possible. |
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
bodies: [][]byte{[]byte("body0"), []byte("body1"), []byte("body2"), []byte("body3")}, | ||
receipts: [][]byte{[]byte("receipt0"), []byte("receipt1"), []byte("receipt2"), []byte("receipt3")}, | ||
tds: [][]byte{[]byte("td0"), []byte("td1"), []byte("td2"), []byte("td3")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding all these fields here is verbose and confusing because it makes it look like the test is checking some of these values when it is not.
I suggest using a function like makeRange(start int, hashes, encodedHeaders [][]byte)
that ensures that all other fields have the correct length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let me know if what I have now looks better to you. I used a makeRange helper to make the tests more concise, but since many of the tests are specifically checking for differences in the lengths of the various other fields, I'm still passing the slices in to the helper for all the fields (not just hashes and encodedHeaders). This ended up looking easier to read in my opinion than passing the desired lengths as int parameters. I can't see how the parameters you suggested above "makeRange(start int, hashes, encodedHeaders [][]byte)
" could support the test cases where we differ the lengths of the various fields. Let me know if you had something else in mind!
Hey @alecps, I see what you're trying to do here, but this has become quite huge, 861 lines to check that each block follows it's parent. The db continuity check runs pretty fast for me 1 minute for alfajores, so I would be in favour of reverting changes to the pre-existing migration code and just adding the db check as a separate command. This gives us the benefit of not having to modify the audited migration code. I don't want to be unilaterally deciding this though so please chime in @karlb & @palango |
1715446
to
6aaf1e3
Compare
@@ -216,7 +216,7 @@ func writeAncientBlocks(ctx context.Context, freezer *rawdb.Freezer, in <-chan R | |||
return fmt.Errorf("failed to write block range: %w", err) | |||
} | |||
blockRangeEnd := blockRange.start + uint64(len(blockRange.hashes)) - 1 | |||
log.Info("Wrote ancient blocks", "start", blockRange.start, "end", blockRangeEnd, "count", len(blockRange.hashes), "remaining", totalAncientBlocks-blockRangeEnd) | |||
log.Info("Wrote ancient blocks", "start", blockRange.start, "end", blockRangeEnd, "count", len(blockRange.hashes), "remaining", totalAncientBlocks-(blockRangeEnd+1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indexing error was missed by the audit
@@ -61,7 +92,7 @@ func openDB(chaindataPath string, readOnly bool) (ethdb.Database, error) { | |||
|
|||
// Opens a database without access to AncientsDb | |||
func openDBWithoutFreezer(chaindataPath string, readOnly bool) (ethdb.Database, error) { | |||
if _, err := os.Stat(chaindataPath); errors.Is(err, os.ErrNotExist) { | |||
if _, err := os.Stat(chaindataPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this type of error check was part of the audit feedback. I fixed several others in this PR, but somehow missed this one.
@piersy Okay I've reverted all the changes to the migration code path. What remains is just the code for the continuity script, the unit tests for CheckContinuity, a fix for an indexing error, and some other super minor drive-by changes. I've moved the rest of the changes that weave the continuity checks into the migration logic to alecps/addContinuityCheckToMigration |
fc28d38
to
a10e683
Compare
a10e683
to
1e8c9cc
Compare
for i := startBlock; i < endBlock; i += batchSize { | ||
count := min(batchSize, endBlock-i+1) | ||
count := min(batchSize, endBlock-i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audit missed this indexing error. It should be endBlock - i
because endBlock is not inclusive. This didn't cause any problems before because the freezer.AncientRange
function returns at most count
elements, and this indexing error could only cause a problem when reading the last section of the ancients (when the number of remaining blocks is less than batchSize
)
var err error | ||
|
||
blockRange.hashes, err = freezer.AncientRange(rawdb.ChainFreezerHashTable, start, count, 0) | ||
blockRange, err := loadAncientRange(freezer, start, count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piersy This is the only refactor to the migration code path from the audited code that I haven't reverted. It's straightforward enough that I thought it might be okay to keep, since we need to use loadAncientRange in the continuity script. Happy to revert it though if you think that's best
// Follows checks if the current block has a number one greater than the previous block | ||
// and if the parent hash of the current block matches the hash of the previous block. | ||
func (e *RLPBlockElement) Follows(prev *RLPBlockElement) (err error) { | ||
if e.Header().Number.Uint64() != prev.Header().Number.Uint64()+1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 12 instances of .Header().Number.Uint64()
, maybe worth putting this in a small function for legibility
Adds a standalone command for checking db continuity without performing a migration.
Unit tests are added for the CheckContinuity function
The continuity script can be run in "fail-fast" mode to quickly check whether the db is corrupted, or run normally to print out every gap detected in the data.
Drive by changes:
Fixes https://github.com/celo-org/celo-blockchain-planning/issues/832
Note: This PR originally added the continuity check logic to the actual migration code path (in addition to having a standalone script), but in an effort to reduce unnecessary changes to the migration logic those changes were reverted and moved to #297. This means that the migration script itself will not detect any gaps in the data being migrated, so the standalone continuity script should be run on the source db before running the migration script.