Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shreds inserted by master causes panic when read by v1.6 #17136

Closed
ryoqun opened this issue May 10, 2021 · 15 comments
Closed

Shreds inserted by master causes panic when read by v1.6 #17136

ryoqun opened this issue May 10, 2021 · 15 comments
Assignees
Milestone

Comments

@ryoqun
Copy link
Member

ryoqun commented May 10, 2021

Problem

Maybe, #16602 started to cause this problem? (CC: @steviez )

$ solana-ledger-tool --ledger bad-deploy-ledger-copy slot 77052081
[2021-05-10T06:40:35.791148527Z INFO  solana_ledger_tool] solana-ledger-tool 1.6.6 (src:fe775a97; feat:3714435735)
[2021-05-10T06:40:35.791215160Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 500000
[2021-05-10T06:40:35.791223466Z INFO  solana_ledger::blockstore] Opening database at "/home/sol/bad-deploy-ledger-copy/rocksdb"
[2021-05-10T06:40:35.815238897Z INFO  solana_ledger::blockstore] "/home/sol/bad-deploy-ledger-copy/rocksdb" open took 24ms
Slot 77052081
thread 'thread 'thread 'blockstore_2blockstore_5' panicked at 'blockstore_3' panicked at '' panicked at 'assertion failed: payload.len() >= expected_data_sizethread 'assertion failed: payload.len() >= expected_data_sizeassertion failed: payload.len() >= expected_data_size', blockstore_0', ', ledger/src/shred.rs' panicked at 'ledger/src/shred.rsledger/src/shred.rs:assertion failed: payload.len() >= expected_data_size::279279:', 279:ledger/src/shred.rs9
:9:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
9
279
:thread '9blockstore_6
' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'blockstore_7' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'blockstore_4' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'blockstore_1' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'blockstore_0' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'blockstore_3' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'thread 'blockstore_5blockstore_2' panicked at '' panicked at 'assertion failed: payload.len() >= expected_data_sizethread 'assertion failed: payload.len() >= expected_data_size', blockstore_4', ledger/src/shred.rs' panicked at 'ledger/src/shred.rs:assertion failed: payload.len() >= expected_data_size:thread '279', thread '279blockstore_7:thread ':ledger/src/shred.rsblockstore_1' panicked at '9assertion failed: payload.len() >= expected_data_sizeblockstore_69:' panicked at '
', ' panicked at 'ledger/src/shred.rsassertion failed: payload.len() >= expected_data_size
279assertion failed: payload.len() >= expected_data_sizethread ':thread '', :', blockstore_3279blockstore_0ledger/src/shred.rs9ledger/src/shred.rs' panicked at ':' panicked at ':
:assertion failed: payload.len() >= expected_data_size9assertion failed: payload.len() >= expected_data_size279279',
', ::ledger/src/shred.rsledger/src/shred.rs99thread '::

blockstore_2279279' panicked at '::assertion failed: payload.len() >= expected_data_size99',

ledger/src/shred.rsthread ':blockstore_5279' panicked at ':assertion failed: payload.len() >= expected_data_size9',
ledger/src/shred.rs:279:9
thread 'blockstore_4' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279:9
thread 'blockstore_5' panicked at 'assertion failed: payload.len() >= expected_data_size', ledger/src/shred.rs:279thread ':9blockstore_1
' panicked at 'assertion failed: payload.len() >= expected_data_sizethread '', blockstore_7ledger/src/shred.rs' panicked at ':assertion failed: payload.len() >= expected_data_size279', thread 'blockstore_6' panicked at 'assertion failed: payload.len() >= expected_data_size:ledger/src/shred.rsthread '', thread '9:blockstore_3ledger/src/shred.rsblockstore_4
279' panicked at ':' panicked at '279thread ':assertion failed: payload.len() >= expected_data_sizeassertion failed: payload.len() >= expected_data_sizethread ':blockstore_29', ', blockstore_09thread '' panicked at '
ledger/src/shred.rsledger/src/shred.rs' panicked at '
blockstore_5assertion failed: payload.len() >= expected_data_sizethread '::assertion failed: payload.len() >= expected_data_size' panicked at '', blockstore_1279279', assertion failed: payload.len() >= expected_data_sizeledger/src/shred.rs' panicked at 'thread '::ledger/src/shred.rs', thread ':assertion failed: payload.len() >= expected_data_sizeblockstore_799:ledger/src/shred.rsblockstore_6279', ' panicked at '

279:' panicked at ':ledger/src/shred.rsassertion failed: payload.len() >= expected_data_size:279assertion failed: payload.len() >= expected_data_size9:', 9:',
279ledger/src/shred.rs:
9ledger/src/shred.rsthread '9
:blockstore_3:
279' panicked at '279::assertion failed: payload.len() >= expected_data_size99',

Proposed Solution

Support reading old format shreds from rocksdb?

@ryoqun
Copy link
Member Author

ryoqun commented May 10, 2021

@mvines Hi, I think this is a minor blocker for testnet v1.7 thing. This effectively makes downgrading in case of emergency impossible. Definitely we want to keep the recovery measure for eventual v1.6 => v1.7 on mainnet-beta thing.

@steviez
Copy link
Contributor

steviez commented May 10, 2021

@ryoqun - To confirm my understanding of this issue, was the node here running master (inserting shreds into blockstore), and then the node started running v1.6 (ie the invocation to solana-ledger-tool --ledger bad-deploy-ledger-copy slot 77052081)

@ryoqun
Copy link
Member Author

ryoqun commented May 10, 2021

@steviez yeah, that's correct in that the two invocations are using the same ledger. Also, note that solana-validator v1.6 should be affected as well like solana-ledger-tool....

@steviez
Copy link
Contributor

steviez commented May 10, 2021

@ryoqun - Thanks for the confirmation. Comparing the branches, this is expected behavior (and a scenario I'll admit I didn't really consider at time of that PR) given what is currently in v1.6. Thinking this through, there are 4 possible scenarios:
a) shred inserted by master, shred retrieved by master
b) shred inserted by v1.6, shred retrieved by v1.6
c) shred inserted by master, shred retrieved by v1.6
d) shred inserted by v1.6, shred retrieved by master

a) and b) are obviously not an issue since each branch is compatible with itself.
c) is the issue we hit here; master inserted a shred without the zero padding, v1.6 retrieved the shred (and didn't zero-pad out again as master does), and failed the assertion because it was too short
d) is not an issue; master will zero-pad shreds out to SHRED_PAYLOAD_SIZE when retrieving from the blockstore so if the shred was inserted by v1.6, the shred will still have the padding and the extend will just be a no-op.

To resolve c), I believe backporting the commit to v1.6 would solve the issue: bc31378

@steviez
Copy link
Contributor

steviez commented May 10, 2021

To resolve c), I believe backporting the commit to v1.6 would solve the issue: bc31378

This is obviously something I'd like to test out first though, preparing a branch now

@mvines
Copy link
Member

mvines commented May 10, 2021

To resolve c), I believe backporting the commit to v1.6 would solve the issue: bc31378

This is obviously something I'd like to test out first though, preparing a branch now

Would backporting cause the tip of v1.6 to no longer be compatible with the latest v1.6.8 release?

@steviez
Copy link
Contributor

steviez commented May 10, 2021

Would backporting cause the tip of v1.6 to no longer be compatible with the latest v1.6.8 release?

Good point - a straight cherry-pick of the commit would result in the issue Ryo initially described occurring between tip of v1.6 and v1.6.8 release. Rather, I think I can be a little more surgical, and cherry-pick part of the commit such that it is backwards compatible

@mvines
Copy link
Member

mvines commented May 10, 2021

why cherrypick at all then? If we can make the tip of v1.6 compatible with v1.6.8 then we can do the same with the tip of master

@steviez
Copy link
Contributor

steviez commented May 10, 2021

why cherrypick at all then? If we can make the tip of v1.6 compatible with v1.6.8 then we can do the same with the tip of master

master has this change to strip the zero-padding from shreds before insertion into the blockstore, and to re-zero-pad on fetch (the padding is necessary for erasure decoding); this prevents us from bloating the ledger with a bunch of 0's. This is a compatibility break with v1.6 because v1.6 doesn't have the logic to zero-pad the shreds on fetch.

Making the tip of master compatible with v1.6 (tip or v1.6.8) would require master to leave the zero-padding on the shred on insertion, which would defeat the purpose of the space saving measure

@mvines
Copy link
Member

mvines commented May 10, 2021

Ok cool. Goal: master <-> latest release on mainnet are always update/downgrade compatible, and interoperable at runtime.

@steviez
Copy link
Contributor

steviez commented May 10, 2021

Ok cool. Goal: master <-> latest release on mainnet are always update/downgrade compatible, and interoperable at runtime.

Understood. For a change like this where there is an explicit break, what would the proper protocol be? Maybe push an "auxiliary commit" that would provide upgrade compatibility from latest release branch (v1.6 in this case) to master, wait for a release in that release branch (ie v1.6.9), and then push the intended change into master?

@mvines

This comment has been minimized.

@mvines
Copy link
Member

mvines commented May 10, 2021

Oh sorry, yeah I this case it's not a runtime feature at all.

So yes, "auxiliary commit" that would provide upgrade compatibility. Then wait for that to roll out to mainnet, and then land the other half on master

@steviez
Copy link
Contributor

steviez commented May 10, 2021

Oh sorry, yeah I this case it's not a runtime feature at all.

No worries, I wasn't aware of FeatureSet so a good thing to learn about in the event I have a need for this later.

So yes, "auxiliary commit" that would provide upgrade compatibility. Then wait for that to roll out to mainnet, and then land the other half on master

Got it - noted for the future

@steviez
Copy link
Contributor

steviez commented May 14, 2021

#17147 fixes the compatibility issue

@steviez steviez closed this as completed May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants