Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct the initial bank state from the snapshot/ and accounts file… #28745

Closed
wants to merge 36 commits into from

Conversation

xiangzhu70
Copy link
Contributor

@xiangzhu70 xiangzhu70 commented Nov 4, 2022

Construct the initial bank state from the bank snapshot in the snapshot/, instead of unarchiving from an archive.

If not, it falls back to using the archives.

Problem

The old way is to from the snapshot archive. Decompressing and unarchiving takes 90s+. This is a big chunk
of the validator startup time.
This new way directly uses the snapshot/ directory, shorten the time to 5s.

Summary of Changes

Build the snapshot// directory to contain a complete state for a bank snapshot. Previously it contains only the bank snapshot file. Now it has the status_cache and the hard-links of all matching accounts/ appendvec files.
With the new argument --snapshot-from-files, when starting up, the validator finds the highest bank snapshot directory, and construct the bank from it. For precaution, if it fails, it will fall back to the original unarchiving route. It really should not fail. Once we have verified this as a solid route, we can remove this argument and make this route as the default.

Testing done:

On a devserver, run without --snapshot-from-file
snapshot untar took 92.4s
with it,
full snapshot deserialize took 5.5s

./target/release/solana-validator --identity ~/validator-keypair.json --vote-account ~/vote-account-keypair.json --rpc-port 8899 --entrypoint entrypoint.devnet.solana.com:8001 --limit-ledger-size --log ~/solana-validator.log --no-poh-speed-test --full-snapshot-interval-slots 200 --maximum-full-snapshots-to-retain 5 --snapshot-from-file

Fixes #

@xiangzhu70 xiangzhu70 force-pushed the bank_from_files branch 2 times, most recently from 9df0000 to 55312c2 Compare November 21, 2022 16:09
@xiangzhu70 xiangzhu70 marked this pull request as ready for review November 21, 2022 19:35
@brooksprumo
Copy link
Contributor

@xiangzhu70 Can you fill out the PR description some more please? Both the Problem section, to provide some context (can point to a GH issue as well), and the Summary of Changes section to get a high level of what has changed. Thanks!

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got through a decent chunk of this, but still need look over the meat of snapshot_utils and bank_fork_utils.

validator/src/main.rs Outdated Show resolved Hide resolved
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
runtime/src/append_vec.rs Outdated Show resolved Hide resolved
runtime/src/bank.rs Outdated Show resolved Hide resolved
runtime/src/bank.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_config.rs Show resolved Hide resolved
runtime/src/snapshot_utils.rs Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
@apfitzge apfitzge self-requested a review November 21, 2022 22:23
@@ -199,11 +206,9 @@ fn bank_forks_from_snapshot(
process::exit(1);
}

let (deserialized_bank, full_snapshot_archive_info, incremental_snapshot_archive_info) =
snapshot_utils::bank_from_latest_snapshot_archives(
let file_ret = if snapshot_config.snapshot_from == SnapshotFrom::File {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these blocks can be cleaned up - it's odd to have several maps from Ok(good_ret) => Some(good_ret) and makes it slightly more confusing as what's happening here, i.e. if SnapshotFrom::File try loading from files, fallback to loading archives, expect we have a valid bank at the end, regardless of path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the flow is:
if the snapshot_config.snapshot_from is not File, then the first stage result is none.
otherwise, try the file route, which can return valid result or none.

For both the above, if the result is none, do the archive way.

Not sure how I can express this better in Rust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do something like:
(snapshot_config.snapshot_from == SnapshotFrom::File).map(|| ...).map_err(|err| warn!(...) err).ok().unwrap_or(|| ...);

map_err is kinda a hack here though, we'd really want inspect_err which isn't stable yet, unfortunately

@@ -2049,6 +2587,22 @@ pub fn verify_snapshot_archive<P, Q, R>(
assert!(crate::serde_snapshot::compare_two_serialized_banks(&p1, &p2).unwrap());
std::fs::remove_file(p1).unwrap();
std::fs::remove_file(p2).unwrap();

// The new the status_cache file is inside the slot directory together with the snapshot file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move the file, won't that break it if we run this verify multiple times? Since it won't bethere the second time.

@apfitzge
Copy link
Contributor

Could you post in the description how much faster loading via files is? I'd assume the speed up is pretty substantial.

@apfitzge apfitzge self-requested a review November 29, 2022 17:14
@xiangzhu70
Copy link
Contributor Author

updated th

Now I have updated the PR summary. Thanks!

@xiangzhu70 xiangzhu70 closed this Dec 9, 2022
@xiangzhu70 xiangzhu70 reopened this Dec 9, 2022
@xiangzhu70 xiangzhu70 force-pushed the bank_from_files branch 2 times, most recently from e3f1a59 to 8db5357 Compare December 12, 2022 17:39
Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few concerns I commented, but overall we're getting close on this I think.

runtime/src/snapshot_utils/snapshot_storage_rebuilder.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils/snapshot_storage_rebuilder.rs Outdated Show resolved Hide resolved
core/src/validator.rs Show resolved Hide resolved
ledger/src/bank_forks_utils.rs Outdated Show resolved Hide resolved
runtime/src/accounts_background_service.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
runtime/src/snapshot_utils.rs Show resolved Hide resolved
validator_config.snapshot_config.snapshot_from == Some(SnapshotFrom::File);

if !snapshot_from_file {
download_snapshots(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be out of scope for this PR. If we fail our load from files, we try from the existing snapshot archives, but we don't download new ones. Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider checking the current snapshot slot. If it's too old, then we download and use the downloaded archive regardless of snapshot_from_file?

validator/src/main.rs Outdated Show resolved Hide resolved
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
xiangzhu70 and others added 2 commits December 14, 2022 21:05
…s, instead

of the archives.

Some additional checking is done to ensure the files are in a consistent state.
If not, it falls back to using the archives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants