Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make background services aware of incremental snapshots #19401

Merged

Conversation

brooksprumo
Copy link
Contributor

@brooksprumo brooksprumo commented Aug 24, 2021

AccountsBackgroundService now knows about incremental snapshots. It is
now also in charge of deciding if an AccountsPackage is destined to be a
SnapshotPackage or not (or just used by AccountsHashVerifier).

!!! New behavior changes !!!

Full snapshots (both bank and archive) MUST succeed.

This is required because of how the last full snapshot slot is
calculated, which is used by AccountsBackgroundService when calling
clean_accounts().

File system calls are now unwrapped and will result in a crash. As Trent told me:

Well I think if a snapshot fails due to some IO error, it's very likely that the operator is going to have to intervene before it works. We should exit error in this case, otherwise the validator might happily spin for several more hours, never successfully writing a complete snapshot, before something else brings it down. This would leave the validator's last local snapshot many more slots behind than it would be had we exited outright and potentially force the operator to abandon ledger continuity in favor of a quick catchup

Fixes #19167
Fixes #19168

@brooksprumo brooksprumo force-pushed the iss-accounts-background-service-4 branch from 0c5eb27 to 9b5d5f9 Compare August 24, 2021 18:20
AccountsBackgroundService now knows about incremental snapshots.  It is
now also in charge of deciding if an AccountsPackage is destined to be a
SnapshotPackage or not (or just used by AccountsHashVerifier).

!!! New behavior changes !!!

Full snapshots (both bank and archive) **MUST** succeed.  If there are
failures, the functions will be retried.  If they still fail, an assert
will be triggered.

This is required because of how the last full snapshot slot is
calculated, which is used by AccountsBackgroundService when calling
`clean_accounts()`.
@brooksprumo brooksprumo force-pushed the iss-accounts-background-service-4 branch from 9b5d5f9 to 6d24603 Compare August 24, 2021 18:20
Copy link
Contributor Author

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot going on in this PR. I've tried to call out most important bits!

Comment on lines 177 to 192
let block_height = snapshot_root_bank.block_height();
let snapshot_type = if block_height
% self.snapshot_config.full_snapshot_archive_interval_slots
== 0
{
Some(SnapshotType::FullSnapshot)
} else if block_height
% self
.snapshot_config
.incremental_snapshot_archive_interval_slots
== 0
{
Some(SnapshotType::IncrementalSnapshot(last_full_snapshot_slot))
} else {
None
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now AccountsBackgroundService is the one responsible for setting the snapshot type (previously this was in AccountsHashVerifier).

Note that the None here is for AccountsPackages that will not be archived, which is possible if the incremental snapshot interval does not equal the accounts hash interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah using the block_height is a lot simpler than trying to find the first slot that crossed the full snapshot boundary. Interesting bit here is the full snapshots will be correlated with skip rate, i.e. the higher the skip rate, the less often - in wallclock time - we will take full snapshots. I think this is fine for v1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That is interesting. Good to note.

runtime/src/accounts_background_service.rs Outdated Show resolved Hide resolved
runtime/src/accounts_background_service.rs Outdated Show resolved Hide resolved

let snapshot_package = SnapshotPackage::from(accounts_package);
let pending_snapshot_package = pending_snapshot_package.unwrap();
let _snapshot_config = snapshot_config.unwrap();
Copy link
Contributor Author

@brooksprumo brooksprumo Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be able to remove SnapshotConfig from AccountsHashVerifier entirely too. I'm thinking about taking some things out of AccountsPackage that live in SnapshotConfig, which I may then use here, so I'm going to leave the SnapshotConfig here for now.

core/src/snapshot_packager_service.rs Outdated Show resolved Hide resolved
Comment on lines +1013 to +1020
assert_eq!(deserialized_bank.slot(), LAST_SLOT,);
assert_eq!(
deserialized_bank.slot(),
EXPECTED_SLOT_FOR_LAST_SNAPSHOT_ARCHIVE
deserialized_bank,
**bank_forks
.read()
.unwrap()
.get(deserialized_bank.slot())
.unwrap()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Now we check the banks are equal too!

Comment on lines 183 to 188
/// Snapshots come in two flavors, Full and Incremental. The IncrementalSnapshot has a Slot field,
/// which is the incremental snapshot base slot.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum SnapshotType {
FullSnapshot,
IncrementalSnapshot,
}

impl SnapshotType {
/// Get the string prefix of the snapshot type
pub fn to_prefix(&self) -> &'static str {
match self {
SnapshotType::FullSnapshot => TMP_FULL_SNAPSHOT_PREFIX,
SnapshotType::IncrementalSnapshot => TMP_INCREMENTAL_SNAPSHOT_PREFIX,
}
}
IncrementalSnapshot(Slot),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By putting the Slot in the IncrementalSnapshot, I can always guarantee there is a last full snapshot if the snapshot type is Incremental. Otherwise I'd have asserts all over the place to ensure that was the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

SnapshotType::IncrementalSnapshot => TMP_INCREMENTAL_SNAPSHOT_PREFIX,
}
}
IncrementalSnapshot(Slot),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking through the actual archive and unarchive functions more, I realized that the prefix didn't need to be unique between full and incremental snapshots. So that just makes this a bit simpler.

Comment on lines +1542 to +1550
let snapshot_storages = snapshot_type.map_or_else(SnapshotStorages::default, |snapshot_type| {
let incremental_snapshot_base_slot = match snapshot_type {
SnapshotType::IncrementalSnapshot(incremental_snapshot_base_slot) => {
Some(incremental_snapshot_base_slot)
}
_ => None,
};
root_bank.get_snapshot_storages(incremental_snapshot_base_slot)
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the snapshot type, get the snapshot storages. In the None case, that means the "snapshot" is not meant to be archived, and will only be used to send to AccountsHashVerifier to push the hash out to the cluster. In that case, we don't need to snapshot any of the storages even.

runtime/src/snapshot_utils.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 24, 2021

Codecov Report

Merging #19401 (e8c1aad) into master (6d93981) will decrease coverage by 0.1%.
The diff coverage is 63.0%.

@@            Coverage Diff            @@
##           master   #19401     +/-   ##
=========================================
- Coverage    82.8%    82.7%   -0.2%     
=========================================
  Files         457      459      +2     
  Lines      130682   130630     -52     
=========================================
- Hits       108289   108115    -174     
- Misses      22393    22515    +122     

@brooksprumo brooksprumo marked this pull request as ready for review August 24, 2021 20:18
core/src/accounts_hash_verifier.rs Outdated Show resolved Hide resolved
pub hash: Hash, // temporarily here while we still have to calculate hash before serializing bank
pub archive_format: ArchiveFormat,
pub snapshot_version: SnapshotVersion,
pub snapshot_archives_dir: PathBuf,
pub expected_capitalization: u64,
pub hash_for_testing: Option<Hash>,
pub cluster_type: ClusterType,
pub snapshot_type: Option<SnapshotType>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this optional? Shouldn't we always know the snapshot type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should hopefully give an answer:
#19401 (comment)

And some more context here too:
#19401 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. We are overloading the snapshot type for some other purpose -- maybe a separate flag for the Hash only. And when it is None, what is really the type? Is it a full or incremental? Does not matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I debated how far to take this... There's a lot of coupling and implied behavior between bank snapshots, snapshot archives, and the hashing... plus how those three interact with the background services... Ultimately I figured that to do it "right" would likely require redoing the whole background services from the ground up, and that would probably be out of scope for this task, so I tried to make it least bad :)

This snapshot_type is used to decide which snapshot storages to get, when to take a snapshot/what time, and if it should be archived by SnapshotPackagerService.

core/src/accounts_hash_verifier.rs Outdated Show resolved Hide resolved
runtime/src/accounts_background_service.rs Outdated Show resolved Hide resolved
runtime/src/accounts_background_service.rs Outdated Show resolved Hide resolved
runtime/src/accounts_background_service.rs Outdated Show resolved Hide resolved
// Hard link the snapshot into a tmpdir, to ensure its not removed prior to packaging.
let snapshot_links = tempfile::Builder::new()
.prefix(&format!("{}{}-", TMP_BANK_SNAPSHOT_PREFIX, bank.slot()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do full snapshot and incremental snapshot look any different now for the tmp dir entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. None of the different temp directories need different prefixes actually, but now we're only differentiating between bank snapshots and snapshot archives.

Comment on lines 183 to 188
/// Snapshots come in two flavors, Full and Incremental. The IncrementalSnapshot has a Slot field,
/// which is the incremental snapshot base slot.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum SnapshotType {
FullSnapshot,
IncrementalSnapshot,
}

impl SnapshotType {
/// Get the string prefix of the snapshot type
pub fn to_prefix(&self) -> &'static str {
match self {
SnapshotType::FullSnapshot => TMP_FULL_SNAPSHOT_PREFIX,
SnapshotType::IncrementalSnapshot => TMP_INCREMENTAL_SNAPSHOT_PREFIX,
}
}
IncrementalSnapshot(Slot),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

core/src/accounts_hash_verifier.rs Outdated Show resolved Hide resolved
pub hash: Hash, // temporarily here while we still have to calculate hash before serializing bank
pub archive_format: ArchiveFormat,
pub snapshot_version: SnapshotVersion,
pub snapshot_archives_dir: PathBuf,
pub expected_capitalization: u64,
pub hash_for_testing: Option<Hash>,
pub cluster_type: ClusterType,
pub snapshot_type: Option<SnapshotType>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. We are overloading the snapshot type for some other purpose -- maybe a separate flag for the Hash only. And when it is None, what is really the type? Is it a full or incremental? Does not matter?

@brooksprumo
Copy link
Contributor Author

@carllin @lijunwangs Thanks for the reviews so far! I've fixed or commented on everything, so this PR should be ready for another round. Thanks in advance!

@brooksprumo brooksprumo merged commit fe9ee91 into solana-labs:master Aug 31, 2021
@brooksprumo brooksprumo deleted the iss-accounts-background-service-4 branch August 31, 2021 23:33
frits-metalogix added a commit to identity-com/solana that referenced this pull request Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants