Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools that help recover corrupted ledger store file #26813

Closed
yhchiang-sol opened this issue Jul 27, 2022 · 4 comments
Closed

Tools that help recover corrupted ledger store file #26813

yhchiang-sol opened this issue Jul 27, 2022 · 4 comments
Assignees
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@yhchiang-sol
Copy link
Contributor

Problem

Solana ledger store uses rocksdb as its underlying storage. In some rare cases such as hardware failure,
rocksdb might report data corruption or block checksum mismatch on one of its sst file. Below is one
example error log from #9009:

[2020-03-22T10:22:48.766837593Z ERROR solana_core::window_service] thread 
Some("solana-window-insert") error BlockstoreError(RocksDb(Error { 
message: "Corruption: block checksum mismatch: expected 3583270445, got 3398136873  
in /mnt/vol1/ledger/rocksdb/165855.sst offset 25107936 size 3758" }))

When this happens, the validator will not be able to continue even if all other sst files are still readable and healthy.
Currently, a clean restart might be the only way to recover.

Proposed Solution

A set of tools that provide a way to recover the corrupted file would be a better solution than a clean restart
as it allows the validator to recover without losing its local data.

The key idea of recovering the corrupted sst file is to first obtain the column family information and key range
of the corrupted file if its metadata blocks are still healthy. Then, based on the column family name and the
key range, we can then copy the data within that range from a healthy validator and replace the corrupted file.

Here're possible solutions:

ledger-tool based solution

  • Obtain the column family and key range information of the corrupted sst file via rocksdb_livefiles() API. Add ledger tool command print-file-metadata #26790 includes a sketch implementation.
  • Conver the key range to slot range.
  • On a healthy validator, use the copy command of the ledger-tool to copy the data from the above key range.
  • Do a full compaction on the above output ledger store. After this, each column family will contain only one sst file.
  • Replace the corrupted sst file using the sst file in the corresponding column family from the above output ledger store.
  • TO DISCUSS: is it okay to directly copy data to an empty ledger store?

lower level solution

  • Obtain the column family and key range information of the corrupted sst file via rocksdb_livefiles just like the previous solution.
  • On a healthy validator, create a new rocksdb database using the column family options of the above column family.
  • Copy all the data within the key range in the target column family from the healthy validator to the newly created rocksdb instance.
  • Do a full compaction of the output rocksdb that results in a single sst file.
  • Replace the corrupted sst file using the sst file in the corresponding column family from the above output ledger store.
  • TO DISCUSS: the solution might run faster but the user might need to pick a healthy validator in the same fork in order to keep the data consistent.

Another solution might be introducing a new RPC call to obtain data within the slot range, but I feel RPC calls are designed for solving real-time tasks and are less suitable for offline recovery tools.

@yhchiang-sol
Copy link
Contributor Author

To simply reduce the downtime while trying to keep the validator in a consistent state, another way is to purge all the data that is no earlier than the slot range of the corrupted file:

  • Obtain the column family and slot range information of the corrupted sst file via rocksdb_livefiles just like the previous solution.
  • From the above slot range, purge all data that is no earlier than the ending slot range of the corrupted file. This part can be done by implementing a new ledger tool command that essentially calls delete_files_in_range(), where the range is 0 to the endint slot range of the corrupted file. The implementation would be something similar to Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger #26651.

@steviez
Copy link
Contributor

steviez commented Aug 31, 2022

The key idea of recovering the corrupted sst file is to first obtain the column family information and key range
of the corrupted file if its metadata blocks are still healthy

  • Supposing the metadata blocks are not healthy, do we have any recourse aside from wiping ledger? I assume manually deleting the SST file would be risky.
  • If the metadata blocks are still alright, we're able to recover most of the SST file right? That is, we can tell which blocks are corrupted and only need to discard those while keeping the healthy ones? Or, is the entire SST file lost? I haven't looked at SST file format in depth in a little while so might need to refresh myself

I saw rocksdb has some repair functions, and it looks like at least one of them is hooked up in Rust wrapper too:
https://github.com/facebook/rocksdb/blob/e7525a1fffd0def3cc4c804e0c6070f7dae0d06a/include/rocksdb/db.h#L1818-L1840

@steviez
Copy link
Contributor

steviez commented Aug 31, 2022

  • On a healthy validator, use the copy command of the ledger-tool to copy the data from the above key range.
  • Do a full compaction on the above output ledger store. After this, each column family will contain only one sst file.
  • Replace the corrupted sst file using the sst file in the corresponding column family from the above output ledger store.

Both approaches you mentioned seem to involve a large amount of manual process. Moreso, it requires an operator to readily have access to a good version of that slot, either through direct access of additional node(s) or through the community. The community in general is really helpful, but again, manual intervention isn't great.

Assuming we can reliably determine the range of a corrupted block, an idea that comes to mind would be to wipe that range from the ledger altogether (ie via purge to erase across all column families).

  • If the corrupted range is older than the most recent snapshot, we'll continue on anyways
    • This doesn't cover RPC / warehouse nodes that want deep history
  • If the corrupted range is newer than the most recent snapshot, repair requests could fill those slots, same as for validators that were offline and missed turbine blast

@steviez
Copy link
Contributor

steviez commented Aug 31, 2022

One more note, our recovery process might vary depending on the CF. For example, if we have the shreds, we should be able to reconstruct the the metadata fields (SlotMeta, Index, etc) on the same machine by re-inserrting. There could also be some column families we don't care about (I think ProgramCosts would be fine to wipe) as well as some that could be more problematic for RPC uses cases that maintain long history (ie TransactionStatus)

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Sep 4, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

2 participants