-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AHV processes the snapshot dirs in place #30978
AHV processes the snapshot dirs in place #30978
Conversation
core/tests/snapshots.rs
Outdated
@@ -471,17 +468,6 @@ fn test_concurrent_snapshot_packaging( | |||
} | |||
} | |||
|
|||
// Purge all the outdated snapshots, including the ones needed to generate the package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purge function no longer purges the PRE-stage bank snapshot dirs. So this test is no longer valid. Removed it. Purging has been tested in other tests. No need to test it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR any more.
Codecov Report
@@ Coverage Diff @@
## master #30978 +/- ##
=========================================
- Coverage 81.5% 81.5% -0.1%
=========================================
Files 734 734
Lines 207153 207157 +4
=========================================
- Hits 168954 168939 -15
- Misses 38199 38218 +19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments on my first pass. I will have to go over this change again, and will do so soon
let storage_entries: Vec<_> = (0..=slot) | ||
.map(|i| Arc::new(AccountStorageEntry::new(&accounts_dir, 0, i as u32, 10))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i
is an AppendVecId
, the slot is 0 for all these entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Reverted the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR any more.
// Create one fake snapshot. | ||
let snapshots_paths: Vec<_> = (slot..=slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting up the dir, then just having let snapshot_paths = vec![snapshot_path];
seems much more clear than this single-iteration loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR any more.
runtime/src/snapshot_utils.rs
Outdated
do_purge(get_bank_snapshots_pre(&bank_snapshots_dir)); | ||
// Only purge the POST snapshot dirs. | ||
// Now that the account packages use the snapshot dir in place, The AHV thread will process the PRE snapshot dirs | ||
// and turn them into POST snapshot dirs, so they will be purged eventually, and there will be no leaking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during development if there is an issue in the PRE -> POST transition time, then do the PRE files ever get cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can this issue be? This does have the assumption that AHV keeps running so it sooner or later advances states of the snapshot dirs from the PRE state into the POST state. Are you saying this assumption could be broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AHV won't keep running if the validator crashes/shutdown.
If these PRE snapshots are never cleaned up except by the PRE -> POST transition, then my directory could just keep growing, right?
Maybe I'm missing something where these are initially removed at start up or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! There could be left-over uncleaned PRE snapshot dirs.
The highest POST snapshot dir will be used to construct a bank, so the validator starts from that bank. From then on, it loops to create new PRE snapshot dirs, which are sent to AHV to be converted to POST snapshot dirs.
The problem is with the old PRE snapshot dirs which are higher than the highest POST dirs. add_bank_snapshot has a check to remove any existing directory before setting up a new PRE dir. But the new PRE slot may not be the same as the old ones. In that case, the old PRE snapshot dirs are left in the system, never got cleaned up.
So, after we boot from a POST snapshot dir, we should clean up all the existing old PRE dirs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after we boot from a POST snapshot dir, we should clean up all the existing old PRE dirs.
After the PR to add the function bank_from_latest_snapshot_dir is committed, I will add the PRE dir cleanup logic in that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #31191 is committed, will use that to do type selective purging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -394,42 +392,56 @@ mod tests { | |||
fs::create_dir_all(&incremental_snapshot_archives_dir).unwrap(); | |||
|
|||
fs::create_dir_all(&accounts_dir).unwrap(); | |||
|
|||
let slot: Slot = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It's idiomatic in Rust to elide type declarations that can be inferred by the compiler.
- Please tie this value to the storage entries below to ensure they stay in sync. So the line below can be:
let storage_entries: Vec<_> = (0..=slot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR any more.
let mut fake_snapshot_file = OpenOptions::new() | ||
.read(true) | ||
.write(true) | ||
.create(true) | ||
.open(&fake_snapshot_path) | ||
.unwrap(); | ||
fake_snapshot_file.write_all(b"Hello, world!").unwrap(); | ||
let fake_status_cache = snapshot_dir.join(SNAPSHOT_STATUS_CACHE_FILENAME); | ||
let mut fake_status_cache_file = OpenOptions::new() | ||
.read(true) | ||
.write(true) | ||
.create(true) | ||
.open(fake_status_cache) | ||
.unwrap(); | ||
fake_status_cache_file | ||
.write_all(b"Hello, status cache!") | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it was using OpenOptions
before, but this seems equivalent and a lot less cluttered.
let mut fake_snapshot_file = File::create(&fake_snapshot_path).unwrap();
fake_snapshot_file.write_all(b"Hello, world!").unwrap();
let fake_status_cache = snapshot_dir.join(SNAPSHOT_STATUS_CACHE_FILENAME);
let mut fake_status_cache_file = File::create(&fake_status_cache).unwrap();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am missing something that makes this different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR any more.
hey @xiangzhu70, could you have a description of the "life of a snapshot". Brooks is probably more familiar than I am, and I'm struggling to understand the reason the change is necessary. Just like snapshot is created -> pre, when does this get transitioned to post, when do they get deleted. |
Sure. Here is the existing code: ABS does the following in order: AHV: SPS: NOTE that the snapshot type being POST only happens in the tmp dir, used for archiving. The original snapshot dir always has the PRE type, never updated, never has the hash calculated. == Below is the change in this PR: |
runtime/src/snapshot_package.rs
Outdated
@@ -55,7 +50,6 @@ impl AccountsPackage { | |||
package_type: AccountsPackageType, | |||
bank: &Bank, | |||
bank_snapshot_info: &BankSnapshotInfo, | |||
bank_snapshots_dir: impl AsRef<Path>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like BankSnapshotInfo
has had snapshot_dir
since #29409.
Seems like removing this parameter could be split out from the other parts of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To split this parameter out, then I need to derive bank_snapshots_dir from bank_snapshot_info.snapshot_dir.parent(), and handle the parent being None error case by defining a new error. But this PR is removing the need of bank_snapshots_dir. So I am not sure if it is worth to get bank_snapshots_dir error handling to work first and then remove it.
Another way of splitting is to commit this PR first with bank_snapshots_dir kept as _bank_snapshots_dir, and then after this PR is committed, create another PR just to remove _bank_snapshots_dir and all the calling parts.
Please let me know your preference. @apfitzge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see that parent() has to be done for the snapshot_links anyway. will split it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created PR #31249 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR any more.
runtime/src/snapshot_utils.rs
Outdated
do_purge(get_bank_snapshots_pre(&bank_snapshots_dir)); | ||
// Only purge the POST snapshot dirs. | ||
// Now that the account packages use the snapshot dir in place, The AHV thread will process the PRE snapshot dirs | ||
// and turn them into POST snapshot dirs, so they will be purged eventually, and there will be no leaking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AHV won't keep running if the validator crashes/shutdown.
If these PRE snapshots are never cleaned up except by the PRE -> POST transition, then my directory could just keep growing, right?
Maybe I'm missing something where these are initially removed at start up or something?
runtime/src/snapshot_package.rs
Outdated
@@ -197,9 +175,12 @@ impl AccountsPackage { | |||
/// NOTE 2: This fn will panic if the AccountsPackage is of type EpochAccountsHash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot comment outside of line range....
NOTE 1 is now wrong. It's not within a TempDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed NOTE 1
673a094
to
5883b30
Compare
c4204dd
to
5fe622b
Compare
Let account pacakge use the snapshot dir, so AHV computes the accounts hash and turns the pre snapshot dir into a post snapshot dir
… snapshot dir archiving
fdf4caf
to
50ea692
Compare
|
||
let snapshot_info = SupplementalSnapshotInfo { | ||
snapshot_links, | ||
snapshot_links: bank_snapshots_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up PR (not this PR), let's rename this field to bank_snapshot_dir
(no s
), since we're not making any new hard/sym links for AccountsPackage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. WIll do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not catch any problems while reading through, but one of the tests is now failing CI.
Fix that one and I think this is ready to ship.
Yes, after switching from hardlink to symlink, archive_snapshot_package() is failing. Debugging it. |
Problem
For booting a bank from a snapshot dir, we need a post snapshot with hash calculated. Previously, the account package is generated with a tmp dir, and the hash was calculated for the tmp dir and packaged into the archive. The snapshot dir under snapshot/ stays at the PRE stage without the account hash.
Summary of Changes
Let account package use the snapshot dir in-place, so AHV computes the accounts hash and turns the pre snapshot dir into a post snapshot dir.
Change SnapshotPackage::snapshot_links from TempDir to PathBuf, let it point to the snapshot dir. Remove the tempdir setup.
Build archive staging directory using the snashot dir info from the updated SnapshotPackage.
Fixes #