Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AHV processes the snapshot dirs in place #30978

Conversation

xiangzhu70
Copy link
Contributor

@xiangzhu70 xiangzhu70 commented Mar 30, 2023

Problem

For booting a bank from a snapshot dir, we need a post snapshot with hash calculated. Previously, the account package is generated with a tmp dir, and the hash was calculated for the tmp dir and packaged into the archive. The snapshot dir under snapshot/ stays at the PRE stage without the account hash.

Summary of Changes

Let account package use the snapshot dir in-place, so AHV computes the accounts hash and turns the pre snapshot dir into a post snapshot dir.

Change SnapshotPackage::snapshot_links from TempDir to PathBuf, let it point to the snapshot dir. Remove the tempdir setup.
Build archive staging directory using the snashot dir info from the updated SnapshotPackage.

Fixes #

@@ -471,17 +468,6 @@ fn test_concurrent_snapshot_packaging(
}
}

// Purge all the outdated snapshots, including the ones needed to generate the package
Copy link
Contributor Author

@xiangzhu70 xiangzhu70 Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purge function no longer purges the PRE-stage bank snapshot dirs. So this test is no longer valid. Removed it. Purging has been tested in other tests. No need to test it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR any more.

@codecov
Copy link

codecov bot commented Apr 3, 2023

Codecov Report

Merging #30978 (d750aeb) into master (466c3c3) will decrease coverage by 0.1%.
The diff coverage is 63.3%.

@@            Coverage Diff            @@
##           master   #30978     +/-   ##
=========================================
- Coverage    81.5%    81.5%   -0.1%     
=========================================
  Files         734      734             
  Lines      207153   207157      +4     
=========================================
- Hits       168954   168939     -15     
- Misses      38199    38218     +19     

@xiangzhu70 xiangzhu70 marked this pull request as ready for review April 3, 2023 20:21
Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments on my first pass. I will have to go over this change again, and will do so soon

Comment on lines 399 to 400
let storage_entries: Vec<_> = (0..=slot)
.map(|i| Arc::new(AccountStorageEntry::new(&accounts_dir, 0, i as u32, 10)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is an AppendVecId, the slot is 0 for all these entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Reverted the change here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR any more.

Comment on lines 403 to 404
// Create one fake snapshot.
let snapshots_paths: Vec<_> = (slot..=slot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting up the dir, then just having let snapshot_paths = vec![snapshot_path]; seems much more clear than this single-iteration loop

Copy link
Contributor Author

@xiangzhu70 xiangzhu70 Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR any more.

do_purge(get_bank_snapshots_pre(&bank_snapshots_dir));
// Only purge the POST snapshot dirs.
// Now that the account packages use the snapshot dir in place, The AHV thread will process the PRE snapshot dirs
// and turn them into POST snapshot dirs, so they will be purged eventually, and there will be no leaking.
Copy link
Contributor

@apfitzge apfitzge Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during development if there is an issue in the PRE -> POST transition time, then do the PRE files ever get cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can this issue be? This does have the assumption that AHV keeps running so it sooner or later advances states of the snapshot dirs from the PRE state into the POST state. Are you saying this assumption could be broken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHV won't keep running if the validator crashes/shutdown.

If these PRE snapshots are never cleaned up except by the PRE -> POST transition, then my directory could just keep growing, right?

Maybe I'm missing something where these are initially removed at start up or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! There could be left-over uncleaned PRE snapshot dirs.

The highest POST snapshot dir will be used to construct a bank, so the validator starts from that bank. From then on, it loops to create new PRE snapshot dirs, which are sent to AHV to be converted to POST snapshot dirs.

The problem is with the old PRE snapshot dirs which are higher than the highest POST dirs. add_bank_snapshot has a check to remove any existing directory before setting up a new PRE dir. But the new PRE slot may not be the same as the old ones. In that case, the old PRE snapshot dirs are left in the system, never got cleaned up.

So, after we boot from a POST snapshot dir, we should clean up all the existing old PRE dirs.

Copy link
Contributor Author

@xiangzhu70 xiangzhu70 Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after we boot from a POST snapshot dir, we should clean up all the existing old PRE dirs.

After the PR to add the function bank_from_latest_snapshot_dir is committed, I will add the PRE dir cleanup logic in that function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #31191 is committed, will use that to do type selective purging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #31191 is committed, will use that to do type selective purging.

Will do purging in #30978.

core/src/accounts_hash_verifier.rs Outdated Show resolved Hide resolved
core/src/snapshot_packager_service.rs Outdated Show resolved Hide resolved
@@ -394,42 +392,56 @@ mod tests {
fs::create_dir_all(&incremental_snapshot_archives_dir).unwrap();

fs::create_dir_all(&accounts_dir).unwrap();

let slot: Slot = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's idiomatic in Rust to elide type declarations that can be inferred by the compiler.
  2. Please tie this value to the storage entries below to ensure they stay in sync. So the line below can be:
let storage_entries: Vec<_> = (0..=slot)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR any more.

core/src/snapshot_packager_service.rs Outdated Show resolved Hide resolved
@apfitzge apfitzge self-requested a review April 7, 2023 18:38
Comment on lines 408 to 275
let mut fake_snapshot_file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(&fake_snapshot_path)
.unwrap();
fake_snapshot_file.write_all(b"Hello, world!").unwrap();
let fake_status_cache = snapshot_dir.join(SNAPSHOT_STATUS_CACHE_FILENAME);
let mut fake_status_cache_file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(fake_status_cache)
.unwrap();
fake_status_cache_file
.write_all(b"Hello, status cache!")
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it was using OpenOptions before, but this seems equivalent and a lot less cluttered.

        let mut fake_snapshot_file = File::create(&fake_snapshot_path).unwrap();
        fake_snapshot_file.write_all(b"Hello, world!").unwrap();

        let fake_status_cache = snapshot_dir.join(SNAPSHOT_STATUS_CACHE_FILENAME);
        let mut fake_status_cache_file = File::create(&fake_status_cache).unwrap();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am missing something that makes this different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR any more.

@apfitzge
Copy link
Contributor

apfitzge commented Apr 7, 2023

hey @xiangzhu70, could you have a description of the "life of a snapshot". Brooks is probably more familiar than I am, and I'm struggling to understand the reason the change is necessary.

Just like snapshot is created -> pre, when does this get transitioned to post, when do they get deleted.

@xiangzhu70
Copy link
Contributor Author

xiangzhu70 commented Apr 7, 2023

hey @xiangzhu70, could you have a description of the "life of a snapshot". Brooks is probably more familiar than I am, and I'm struggling to understand the reason the change is necessary.

Just like snapshot is created -> pre, when does this get transitioned to post, when do they get deleted.

Sure.

Here is the existing code:

ABS does the following in order:
Call add_bank_snapshot() which creates a snapshot dir (with a pre type snapshot file in it)
It creates a tmp dir with the snapshot pre file, creates an AccountsPacakge pointing to the tmp dir, sends the account package to AHV
It calls purge_old_bank_snapshots, which delete the old snapshot_dirs.

AHV:
process the received AccountsPackage, calculate the snapshot hash, and update the pre-type snapshot file in the tmp dir, to post type.
sends a snapshot package to SPS

SPS:
process the snapshot package and creates an archive.

NOTE that the snapshot type being POST only happens in the tmp dir, used for archiving. The original snapshot dir always has the PRE type, never updated, never has the hash calculated.

== Below is the change in this PR:
When ABS creates an account package, instead of creating a tmp dir, use the original snapshot dir. Consider the snapshot/<slot> dirs as having states. They are created in PRE state. AHV processes them and turn them into POST state, which have the correct hashes, and are good for constructing the banks.

@@ -55,7 +50,6 @@ impl AccountsPackage {
package_type: AccountsPackageType,
bank: &Bank,
bank_snapshot_info: &BankSnapshotInfo,
bank_snapshots_dir: impl AsRef<Path>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like BankSnapshotInfo has had snapshot_dir since #29409.
Seems like removing this parameter could be split out from the other parts of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To split this parameter out, then I need to derive bank_snapshots_dir from bank_snapshot_info.snapshot_dir.parent(), and handle the parent being None error case by defining a new error. But this PR is removing the need of bank_snapshots_dir. So I am not sure if it is worth to get bank_snapshots_dir error handling to work first and then remove it.

Another way of splitting is to commit this PR first with bank_snapshots_dir kept as _bank_snapshots_dir, and then after this PR is committed, create another PR just to remove _bank_snapshots_dir and all the calling parts.

Please let me know your preference. @apfitzge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see that parent() has to be done for the snapshot_links anyway. will split it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created PR #31249 for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR any more.

do_purge(get_bank_snapshots_pre(&bank_snapshots_dir));
// Only purge the POST snapshot dirs.
// Now that the account packages use the snapshot dir in place, The AHV thread will process the PRE snapshot dirs
// and turn them into POST snapshot dirs, so they will be purged eventually, and there will be no leaking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHV won't keep running if the validator crashes/shutdown.

If these PRE snapshots are never cleaned up except by the PRE -> POST transition, then my directory could just keep growing, right?

Maybe I'm missing something where these are initially removed at start up or something?

@@ -197,9 +175,12 @@ impl AccountsPackage {
/// NOTE 2: This fn will panic if the AccountsPackage is of type EpochAccountsHash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot comment outside of line range....

NOTE 1 is now wrong. It's not within a TempDir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed NOTE 1

@xiangzhu70 xiangzhu70 force-pushed the account_package_use_snapshot_dir_in_place branch from fdf4caf to 50ea692 Compare April 25, 2023 23:03

let snapshot_info = SupplementalSnapshotInfo {
snapshot_links,
snapshot_links: bank_snapshots_dir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR (not this PR), let's rename this field to bank_snapshot_dir (no s), since we're not making any new hard/sym links for AccountsPackage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. WIll do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
brooksprumo
brooksprumo previously approved these changes Apr 26, 2023
Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not catch any problems while reading through, but one of the tests is now failing CI.

Fix that one and I think this is ready to ship.

@xiangzhu70
Copy link
Contributor Author

one of the tests is now failing CI.

Yes, after switching from hardlink to symlink, archive_snapshot_package() is failing. Debugging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants