Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Snapshots #17088

Closed
29 tasks done
brooksprumo opened this issue May 6, 2021 · 40 comments
Closed
29 tasks done

Incremental Snapshots #17088

brooksprumo opened this issue May 6, 2021 · 40 comments
Assignees
Milestone

Comments

@brooksprumo
Copy link
Contributor

brooksprumo commented May 6, 2021

Problem

Startup time for nodes is slow when having to download a large snapshot.

Proposed Solution

Incremental snapshots! Instead of always having to download large, full snapshots, download a full snapshot once/less often, then download a small incremental snapshot. The expectation/hope/thought is that only a small number of accounts are touched often, so use incremental snapshots to optimize for that behavior. At startup, now a node with an existing full snapshot only needs to download a new, small incremental snapshot.

  • Create full snapshots much less often (?)
    • every 100,000 slots? at epoch? SLOTS_PER_EPOCH / 2 - 1?
  • Create incremental snapshots every 100 slots (?)
  • Each incremental snapshot is the difference from the last full snapshot
  • Old incremental snapshots can be cleaned up, but save at least one extra as fallback
  • Add a new snapshot field to gossip to differentiate between full and incremental snapshots
    • The gossip info for incremental snapshots will need to include the slot of the full snapshot that this incremental snapshot is based on

Example

slot 100,000: full snapshot (A)
slot 100,100: incremental snapshot (B)
slot 100,200: incremental snapshot (C)
...
slot 1xx,x0: incremental snapshot (D)
...
slot 200,000: full snapshot (E)
  • Incremental snapshot (ISS for short) B is the diff between full snapshot (FSS) A and ISS B. Similarly, ISS C = diff(A, C), and so on.
  • The latest snapshot is still the valid snapshot. If the latest snapshot is an incremental snapshot, replay the FSS then the ISS.
  • Incremental snapshots older than a full snapshot can be deleted (i.e. FSS E supersedes FSS A, and ISS B, C, and D).
  • When ISS D is created, ISS B can be deleted.
  • If at a slot between D and E, a new node would query for FSS A and then ISS D through gossip.

Details

Storing an Incremental Snapshot

  1. Get the slot from the last full snapshot
  2. Snapshot the bank (same as for FSS)
  3. Snapshot the status cache (slot deltas) (same as FSS)
  4. Package up the storages (AppendVecs) from after the FSS
  5. Make archive

Loading from an Incremental Snapshot

  1. Get the highest full snapshot as done now
  2. Get the highest incremental snapshot based on the full snapshot from above
  3. Extract full snapshot
  4. Extract incremental snapshot
  5. Rebuild the AccountsDb from the storages in both FSS and ISS
  6. Rebuild the Bank from the ISS

Validator

  • new CLI args for setting ISS interval
  • loading ISS at startup
  • creating ISS periodically
  • discovering and downloading ISS in bootstrap

Background Services

  • AccountsBackgroundService will need to know about the last FSS slot, as to not clean past it
  • AccountsBackgroundService will now decide based on the full/incremental snapshot interval if the snapshot package will be a FSS or an ISS
  • AccountsHashVerifier no longer needs to decide full vs incremental
  • SnapshotPackagerService is largely unchanged

AccountsDb

  • Update clean_accounts() to add a new parameter, last_full_snapshot_slot to not clean zero-lamport accounts above the last FSS slot

Ledger Tool

  • Update CLI args to set maximum number of incremental snapshots to retain

RPC

  • Add support for downloading incremental snapshots

Gossip

  • Add incremental snapshot hashes to CrdsData

Bootstrap

  • Discover and download incremental snapshots during bootstrap

Testing

Unit Tests

  • snapshot_utils: roundtrip bank to snapshot to bank for FSS
  • snapshot_utils: roundtrip bank to snapshot to bank for ISS
  • snapshot_utils: cleanup zero-lamport accounts in slots after FSS

Integration Tests

core/tests/snapshots.rs
  • Make a similar test as the bank forks test, but also creating incremental snapshots
  • Make a new test that spins up all the background services and ensures FSS and ISS are taken at the correct intervals, and they deserialize correctly
local_cluster
  • Make a similar test that generates ISS on one node, and the other node downloads then loads from it
  • Make a test for startup processing new roots past full snapshot interval

Questions

  • how often should incremental snapshots be created?
    • more is good when (re)joining (faster startup time?), but less is good for the running node (less resource utilization?)
    • does it matter when incremental snapshots would be made? Like after/before certain cleanup code?
  • should incremental snapshots only exist locally, or should they also be sent to new nodes
    • i'm guessing we want to send incremental snapshots to new nodes as well, so they start up faster
  • what goes in the incremental snapshot?
    • is it all the same data types as a full snapshot, just the delta since the last snapshot?
  • should full snapshots be created less/same/more frequently now?
    • likely not more... but there for completeness
    • still need full snapshots for a new node joining the network
  • what tests are needed?
    • obviously a test to make sure it works
    • ensure fallback to full snapshot works if an incremental snapshot is borked

Related Work

Original Snapshot Work

Future Work

  • Dynamically decide when to generate full and incremental snapshots.
  • With the current implementation, it's highly beneficial if nodes use the same full snapshot interval. This is so at bootstrap if a node needs to download a snapshot, and already has a full snapshot, it's most likely to not need to download another full snapshot, and instead just the incremental snapshot. More discovery methods or decisions could be added to RPC/bootstrap to better support different full snapshot intervals.

Tasks

@sakridge
Copy link
Member

sakridge commented May 6, 2021

Actually the only problem this fixes is the one that startup takes longer because the node might have to download a large snapshot. Now, if the node has the large 'full' snapshot, then it just needs to download the correct incremental one to apply on top of that. The snapshot extraction speed would be exactly the same, maybe worse because it now has to combine both snapshots and the amount of data it is ingesting will be the same at the end of the computation.

Incremental snapshot would be on top of a full snapshot. The idea is that a validator creates a full snapshot maybe every 100,000 slots, and then creates an incremental snapshot every 100 slots. A node joining the network from nothing would have to download the full and and then the latest incremental that applies on top of it.

Some solution steps/ideas:

  • Add another snapshot field to gossip. This will then have the snapshot slot and hash, but also the parent 'full' snapshot slot so that the client node knows which incremental options it has and which match up with what 'full' snapshot it has.
  • Download logic changed to see the above and choose whether to download a 'full'+incremental or an incremental based on a 'full' that it already has.
  • Snapshots today become 'full' snapshots. Add a new flag for when the snapshot service needs to create an incremental one.
  • On full snapshot creation, the logic is the same as today. Full clean to try to get the state as small as possible.
  • After full snapshot is created, now the clean_accounts cannot operate on the range including the last 'full' slot. It needs to only operate on newer slots so that the append-vecs apply correctly onto the 'full' files. Delta will now clean in this special range and then collect all append-vecs which are present in this range to create the new delta snapshot.

@sakridge
Copy link
Member

sakridge commented May 6, 2021

  • how often should incremental snapshots be created?
    100 slots is what we try to create snapshots at now, I think that's a fine start. Although today, it takes longer than 100 slots to create a snapshot on mainnet-beta.

  • does it matter when incremental snapshots would be made? Like after/before certain cleanup code?
    I think the current place in accounts_background_service is fine.

  • should incremental snapshots only exist locally, or should they also be sent to new nodes
    Yes, definitely want to share with other nodes, that's one of the main benefits.

  • what goes in the incremental snapshot?
    Yes, I would say the same data, just a different set of updates. The current snapshot code collects append-vecs for all rooted slots below the snapshot target slot, this would collect slots from last_full_snapshot_slot to incremental_snapshot_slot

  • should full snapshots be created less/same/more frequently now?
    Less, way less. Once we have some data about how big the incremental grows, then we can better tell.

There's a couple strategies here. One is a fixed interval, and have all nodes in the network try to create at the same block height. That is nice because then a node can get a full snapshot from one node and then it can likely find an incremental one later from another node which used the same full snapshot slot. Incremental snapshots with different parents will obviously be incompatible.

Another could be to create it dynamically based on how quickly the incremental grows. It's expected the incremental will keep getting larger and larger and potentially as large as the full snapshot. At that point or maybe 50% of the size of the full, then you roll up the state and just create a new full snapshot then. This may be necessary if the fixed interval doesn't work well. If a node ends up downloading a full snapshot and an incremental one just as big, then that's not great.

  • still need full snapshots for a new node joining the network
    yes, hopefully the node has one locally. Since they don't update as often, then hopefully that will be the case.

  • what tests are needed?
    obviously a test to make sure it works - yes
    ensure fallback to full snapshot works if an incremental snapshot is borked - yes. Maybe it could just download a new incremental?

Another idea is that we re-store part of the account state on each slot to keep the append-vec number down to under the number of slots in an epoch. This isn't great, because it then will bloat the size of the incremental snapshots because those will look like new updates to the accounts system. I think it would be good to remove this and then support combining accounts from different slots into a single append-vec.

This is the account store code as part of the rent collection:

self.store_account(&pubkey, &account);

@brooksprumo
Copy link
Contributor Author

* what goes in the incremental snapshot?
  Yes, I would say the same data, just a different set of updates. The current snapshot code collects append-vecs for all rooted slots below the snapshot target slot, this would collect slots from `last_full_snapshot_slot` to `incremental_snapshot_slot`

@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way.

Multiple

slot 1,000: full snapshot (A)
slot 1,100: incremental snapshot (B)
slot 1,200: incremental snapshot (C)
...
slot 1,x00: incremental snapshot (S)
...
Slot 2,000: full snapshot (V)

Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc.

Single

Same picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed.

@sakridge
Copy link
Member

sakridge commented May 6, 2021

@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way.

Multiple

slot 1,000: full snapshot (A)
slot 1,100: incremental snapshot (B)
slot 1,200: incremental snapshot (C)
...
slot 1,x00: incremental snapshot (S)
...
Slot 2,000: full snapshot (V)

Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc.

Single

Same picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed.

I was thinking the node would just have a single incremental which would replace all previous incremental ones, but I'm open to arguments for the multiple design (or others?)

I just think with the multiple you will have a lot of duplicated states updated in the subsequent snapshots, and only 1 is useful. So there will be a lot of overlap. I think there are some small set of accounts updated a lot, like once per slot, and there are many more accounts which are only updated very infrequently, like once in a million or more slots.

Hopefully we capture those infrequently updated accounts with the full snapshot, and the incremental captures the frequently updated ones.

edit: Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one.

@brooksprumo
Copy link
Contributor Author

I was thinking the node would just have a single incremental which would replace all previous incremental ones,

Sounds good!

Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one.

I like it.

@carllin
Copy link
Contributor

carllin commented May 7, 2021

I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc.

You could even have different threads packaging these at different intervals, or maybe different validators package snapshots at these varying intervals. I.e. some package at a faster rate than others.

The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up. This might be also useful for nodes trying to catch up if they can fast sync small incremental snapshots from near the tip of the network.

@brooksprumo
Copy link
Contributor Author

I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc.

The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up.

@carllin This design sounds like it fall under the "Multiple" category of number-of-incremental-snapshots-between-full-snapshots. Is that right?

@sakridge
Copy link
Member

sakridge commented May 7, 2021

Another way to do it:

Split the snapshot across a range of slots, then keep track of every instance of clean_accounts and shrink_slots and then keep a dirty set of stores to then regenerate the new snasphot slices for only those slots that had changed data.

@carllin
Copy link
Contributor

carllin commented May 7, 2021

I think one of the nice things about how we've organized storage entries by slot currently is that the set of storage entries forms a natural diff tracker.

So for instance if we were taking a snapshot every 100 slots:

Slot 100, Slot 200, Slot 300

Then as long as you guarantee clean doesn't progress past the slot while you're grabbing a copy of the storage entries (max_clean_root acts as the guard 😃 ), the diff from 100 to 200 would just be the storage entries for slots in the range (100, 200], the diff from 200 to 300 would just be the storage entries in the range (200, 300].

And this can be expanded to arbitrary intervals, 200, 400, etc. Also while you're generating the 100 slot diff lets say from (200, 300], you could use that to then generate the 200 slot diff from (100, 300] (
because they utilize the same set of storage entries in the overlapping range), which could be the "accumulator" incremental snapshot Stephen described earlier.

For v1 one approach is to just focus on using a fixed, configurable interval N. Using this, I could imagine a tiered system where certain validators are generating snapshots at varying intervals to provide better coverage of the space. For instance let's say we had one set of validators generating every 100 slots, another set 1000 slots, and another 10,000 slots. Then once the 100 interval validators have generated 10 such diffs, they can start dropping the earliest one since they know that range is now covered by the validators who generated the 1000 slot interval. And the validators who generated the 1000 slot intervals could do the same by relying on the validator with 10,000 slots coverage

@ryoqun
Copy link
Member

ryoqun commented May 12, 2021

hey, late for the party. :)

I thought on this a bit. and I may have something to add more color.

Still inheriting the delta slot interval design discussed so far, I think we can forgo generating those delta snapshots at some guessed intervals. Then we can realize yet faster startup? My idea is to reuse accounts dir across reboots and rather dumb (and hopefully easy-to-secure) ondemand delta snapshot api endpoint like ./snapshot/accounts/1111X00-1111Y00.zst with little system load for trusted validators?

This design can come later, but this mostly overlaps the delta snapshot archive generation and download code.

the new restart flow:

  • exiting node...
    • serialize root bank and the whole index into ledger dir (yeah, would be 10-15G) (this could write secondary index as well)
    • and prune unrooted older slots?
    • then finally mark the dir as successfully finalized for next reboot.
  • booting node...
    • deserialize rooted bank
    • only remove newer appendvecs after the serialized rooted bank
    • read the serialized index (would be 10-15G disk io, but should be faster than recomputing it. maybe with borsh? lol)
    • (optional: check accounts hash?)
    • request to one of trusted validators for /snapshot/accounts/1111XX00-1111YY00.zst. XX is from deserialized rooted bank and YY is from gossip.
  • requested trusted validator...
    • Some sanity range check against too large slot range
    • Grab Arc<AppendVec> like snapshot generation process
    • Read those appendvecs in order while removing updated accounts (this is to compress well the final delta snapshot)
      • for this, we can reuse AccountsIndex reconstruction code needing relatively small memory.
      • for zero-lamport accounts, I think we need to pause clean a bit?
    • Then create final sorted and shrunken appendvec and stream it to the booting node via zstd compression
  • booting node...
    • apply the delta on-the-fly snapshot appendvecs to ./accounts and restored indexes while streaming off.
  • done

my assumptions and observations:

  • accounts are updated in two extremely-distinct patterns: very frequently and large (serum state, any future fun dapp game's world state, etc) VS very infrequently and small (spl-token holdings, user states self-custodied by end-user's keypairs)
    • this write pattern should be the case for very long time. we just see increased number of such updates per slot as we grow.
    • So, omitting the frequently updated older account states should be very effective for compressing.
  • the network bandwidth should be the bottle neck. so minimize the delta snapshot as much as possible.
    • also those legitimate access shouldn't high because validator doesn't restart so often.
  • the needed rpc endpoint should be rather primitive and easy to secure.
    • also, should be less resource intensive (maybe io bound)
    • the increased threat of DOS for the trusted node isn't significant. (already quite easy to saturate victim's outgoing network by bogus request)
  • delta snapshot is needed for quick operator-initiated restarts almost always (so, older delta snapshot archive's utility will wane soon).
    • and the delta range END must be the most recent root almost absolutely (so, we can exploit this property for less harm of replaying regarding pause of cleaning)
    • unclean restarts should be rare.
  • accounts dir recreation is kind of wasted time to begin with..
  • index generation is another major part of wasteful booting process.
  • this introduces accounts dir layout and index serialization compatibility across reboots, but I think this isn't big issue.

@brooksprumo
Copy link
Contributor Author

OK, looks like there are a few different ways that I could go to implement incremental snapshots (or whatever we call it). I don't know how to quantify what way is the right way to go though. Who should be the one to pick? Or, are there additional data points that I can gather that would make the decision clear?

@ryoqun
Copy link
Member

ryoqun commented May 14, 2021

@carllin @sakridge could you share initial thoughts on my alternative idea?: #17088 (comment)

@carllin
Copy link
Contributor

carllin commented May 17, 2021

@ryoqun

serialize root bank and the whole index into ledger dir (yeah, would be 10-15G) (this could write >secondary index as well)
and prune unrooted older slots?
then finally mark the dir as successfully finalized for next reboot

As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.

Some sanity range check against too large slot range
Grab Arc like snapshot generation process
Read those appendvecs in order while removing updated accounts (this is to compress well the final >delta snapshot)
for this, we can reuse AccountsIndex reconstruction code needing relatively small memory.
for zero-lamport accounts, I think we need to pause clean a bit?
Then create final sorted and shrunken appendvec and stream it to the booting node via zstd >compression

From my understanding, this is an on demand streaming of AccountsDb storages. However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.) for nodes who crashed/corrupted their serialized/saved state.

As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I.

For instance, if we have some nodes packaging incremental snapshots every 100 slots, and the packaging process takes 500 slots, then:

  1. On demand snapshot will be 500 slots behind by the time the snapshots arrives
  2. If you just took the last incremental snapshot, it would only be 100 slots behind

This implies to me that having some nodes doing regular snapshotting at different varying intervals I might be better for catchup than on demand snapshotting

@ryoqun
Copy link
Member

ryoqun commented May 18, 2021

As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.

thanks for confirming! yeah, hopefully restarting by itself should make this synchronization requirement easy. :)

However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.)

Yeah, but I think this should be rare.

as for the delta snapshot, I think status cache and bank should be small compared to the accounts storage. So, I omitted to mention. Maybe, the ondemand snapshot endpoint processing need to briefly grab the root bank and stash those binaries and include them into the delta snapshot archive

As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I.
....

Thanks for great analysis. :)

Firstly, I just noticed that avoidance of purging accounts and recomputing index can be realized for both incremental and ondemand snapshotting if both just restarts with proper coordination like saving the incremental snapshot locally before exiting (CC: @brooksprumo ) Originally, I thought this is only possible because ondemand can freely specify the start of slot for fetching snapshot dynamically. Maybe you can select and download an incremental snapshot which slightly overlaps with the local root to take the same optimization?

This implies to me that having some nodes doing regular snapshotting at different varying intervals I might be better for catchup than on demand snapshotting

Oh, I'm getting a clue. The different varying intervals part. So, are you assuming newly-restarted validator usually needs to fetch 1 incremental snapshot in normal case? I was originally concerned about wasted bandwidth for the case of multiple incremental snapshot download (i blindly thought about 2-3 deltas, hard-coded 100 slots interval). The wasted bandwidth comes from the fact that multiple incremental snapshots should contain substantial amount of outdated (duplicated) account state between them. (ondemand snapshot tries to hard de-duplicate it)

I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I.

yeah. that's true. However, I thought we can offset that delay with vastly reduced network bandwidth by de-duplicating data across equivalent multiple incremental snapshots. As said above, if we can realize 1 delta snapshot download per restart under normal case. the merit of ondemand snapshot is small, though. :)

Also, I don't think ondemand snapshot takes so long. Recent appendvecs should generally be on pagecache and index creation is basically linear processing so, bound to disk read bandwidth (unlike incremental snapshot no need to write archive for archive, only needs to grab mmap).

@sakridge
Copy link
Member

Handling the zero-lamport account updates sounds hard to me for @ryoqun idea. The target node would have to have the same clean state as the source node or some way to reconcile it. I think initially starting from a known state is an easier solution. I think these on-demand ideas might be good to explore once we have the basic mechanism working.

@brooksprumo
Copy link
Contributor Author

Thanks for the input, everyone. I'm going ahead with the single incremental snapshot that will be set at an interval of 100 slots, and full snapshots at 100,000 slots.

One of the implementation details is that I'll need to pass around a slightly different set of parameters for an incremental snapshot vs a full snapshot. I was thinking I could either do this by either:

  1. Creating a new code path for everything related to incremental snapshots. So new functions that duplicate most all the existing snapshot functions, and also channels/PendingSnapshotPackage, for incremental snapshots.

or

  1. Make a Snapshot enum(s) with fields for Full and Incremental, and do this all the way down. This will require updating all the existing functions too, to check for which snapshot and do the right thing.

There's some pros and cons to both ways, but also neither seems like the clear better way. Given your knowledge of the codebase, and also Rust, does one way sound better than the other?

@brooksprumo
Copy link
Contributor Author

Also, thinking about this a bit more, since the current snapshot logic is based on multiples of Account Hash Interval, I could also piggy back on it and do an incremental snapshot for every account hash interval.

The existing snapshot functions would need to have parameters added for being able to handle incremental snapshots. Then AccountsHashVerifier would need some more logic to do either an incremental snapshot or a full snapshot based on the snapshot interval.

Is that an easier/better than the other two ways?

@behzadnouri
Copy link
Contributor

Very late to the party here : )
Last week Brooks asked me about the gossip part of this and I just got the chance to take a look how this is impacting gossip.

From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot.

So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse.

@sakridge
Copy link
Member

Very late to the party here : )
Last week Brooks asked me about the gossip part of this and I just got the chance to take a look how this is impacting gossip.

From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot.

So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse.

The incremental snapshot values would be updated at roughly the same speed as snapshot values today. The regular snapshot rate will be reduced significantly.

@behzadnouri
Copy link
Contributor

Can we consider this alternative implementation of incremental snapshots:

  • Nodes locally keep latest snapshot at multiples of 100,000 slots as well as recent slots.
    • So only one extra snapshot in addition to whatever snapshots they currently keep.
  • Lets say node a has snapshot at slot 812,345.
    • So it also has snapshot at slot 800,000 as well (last multiple of 100k).
  • If node b wants to start and it already has some snapshot at multiple 100k it tells a what that slot is (and its respective hash).
    • In the case that b tells a: "I already have snapshot at 800k with this hash" and it matches what a has, then a will compute the diff between its latest snapshot (i.e. 812,345) and 800k and only return that 812,345 diff 800k.
    • Otherwise, if things do not match or b does not have any 100k snapshots, a will send back both the 800k snapshot and 812,345 diff 800k (so 2 files but same total bytes as before).

So, effectively the difference is that:

  • Incremental snapshot conceptually only exist at the time one node wants to start off another node, and is only implemented at snapshot send/receive code.
  • Every where else in the code snapshot means "full snapshot". No other part of the runtime (and in particular gossip) need to be updated to support incremental snapshots.

Now, to save disk space, under the hood a may store 812,345 as a diff off 800k but that is only an optional internal optimization and not exposed outside of snapshoting code. If node a chose to do so, it will also speed things up when responding to node b.

@sakridge
Copy link
Member

sakridge commented Jun 15, 2021

How does the node know that whatever data it got for diff of slot 812,345 -> 800,000 is good and matches the rest of the network?

brooksprumo added a commit that referenced this issue Jul 29, 2021
This commit builds on PR #18504 by adding a test to core/tests/snapshot.rs for Incremental Snapshots. The test adds banks to bank forks in a loop and takes both full snapshots and incremental snapshots at intervals, and validates they are rebuild-able.

For background info about Incremental Snapshots, see #17088.

Fixes #18829 and #18972
brooksprumo added a commit to brooksprumo/solana that referenced this issue Sep 21, 2021
When reconstructing the AccountsDb, if the storages came from full and
incremental snapshots generated on different nodes, it's possible that
the AppendVec IDs could overlap/have duplicates, which would cause the
reconstruction to fail.

This commit handles this issue by unconditionally remapping the
AppendVec ID for every AppendVec.

Fixes solana-labs#17088
brooksprumo added a commit to brooksprumo/solana that referenced this issue Sep 21, 2021
When reconstructing the AccountsDb, if the storages came from full and
incremental snapshots generated on different nodes, it's possible that
the AppendVec IDs could overlap/have duplicates, which would cause the
reconstruction to fail.

This commit handles this issue by unconditionally remapping the
AppendVec ID for every AppendVec.

Fixes solana-labs#17088
@brooksprumo brooksprumo reopened this Sep 24, 2021
@brooksprumo
Copy link
Contributor Author

DONE!

dankelleher pushed a commit to identity-com/solana that referenced this issue Nov 24, 2021
When reconstructing the AccountsDb, if the storages came from full and
incremental snapshots generated on different nodes, it's possible that
the AppendVec IDs could overlap/have duplicates, which would cause the
reconstruction to fail.

This commit handles this issue by unconditionally remapping the
AppendVec ID for every AppendVec.

Fixes solana-labs#17088
@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants