-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tools that help recover corrupted ledger store file #26813
Comments
To simply reduce the downtime while trying to keep the validator in a consistent state, another way is to purge all the data that is no earlier than the slot range of the corrupted file:
|
I saw rocksdb has some repair functions, and it looks like at least one of them is hooked up in Rust wrapper too: |
Both approaches you mentioned seem to involve a large amount of manual process. Moreso, it requires an operator to readily have access to a good version of that slot, either through direct access of additional node(s) or through the community. The community in general is really helpful, but again, manual intervention isn't great. Assuming we can reliably determine the range of a corrupted block, an idea that comes to mind would be to wipe that range from the ledger altogether (ie via purge to erase across all column families).
|
One more note, our recovery process might vary depending on the CF. For example, if we have the shreds, we should be able to reconstruct the the metadata fields (SlotMeta, Index, etc) on the same machine by re-inserrting. There could also be some column families we don't care about (I think ProgramCosts would be fine to wipe) as well as some that could be more problematic for RPC uses cases that maintain long history (ie TransactionStatus) |
Problem
Solana ledger store uses rocksdb as its underlying storage. In some rare cases such as hardware failure,
rocksdb might report data corruption or
block checksum mismatch
on one of its sst file. Below is oneexample error log from #9009:
When this happens, the validator will not be able to continue even if all other sst files are still readable and healthy.
Currently, a clean restart might be the only way to recover.
Proposed Solution
A set of tools that provide a way to recover the corrupted file would be a better solution than a clean restart
as it allows the validator to recover without losing its local data.
The key idea of recovering the corrupted sst file is to first obtain the column family information and key range
of the corrupted file if its metadata blocks are still healthy. Then, based on the column family name and the
key range, we can then copy the data within that range from a healthy validator and replace the corrupted file.
Here're possible solutions:
ledger-tool based solution
lower level solution
Another solution might be introducing a new RPC call to obtain data within the slot range, but I feel RPC calls are designed for solving real-time tasks and are less suitable for offline recovery tools.
The text was updated successfully, but these errors were encountered: