-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Snapshots #17088
Comments
Actually the only problem this fixes is the one that startup takes longer because the node might have to download a large snapshot. Now, if the node has the large 'full' snapshot, then it just needs to download the correct incremental one to apply on top of that. The snapshot extraction speed would be exactly the same, maybe worse because it now has to combine both snapshots and the amount of data it is ingesting will be the same at the end of the computation. Incremental snapshot would be on top of a full snapshot. The idea is that a validator creates a full snapshot maybe every 100,000 slots, and then creates an incremental snapshot every 100 slots. A node joining the network from nothing would have to download the full and and then the latest incremental that applies on top of it. Some solution steps/ideas:
|
There's a couple strategies here. One is a fixed interval, and have all nodes in the network try to create at the same block height. That is nice because then a node can get a full snapshot from one node and then it can likely find an incremental one later from another node which used the same full snapshot slot. Incremental snapshots with different parents will obviously be incompatible. Another could be to create it dynamically based on how quickly the incremental grows. It's expected the incremental will keep getting larger and larger and potentially as large as the full snapshot. At that point or maybe 50% of the size of the full, then you roll up the state and just create a new full snapshot then. This may be necessary if the fixed interval doesn't work well. If a node ends up downloading a full snapshot and an incremental one just as big, then that's not great.
Another idea is that we re-store part of the account state on each slot to keep the append-vec number down to under the number of slots in an epoch. This isn't great, because it then will bloat the size of the incremental snapshots because those will look like new updates to the accounts system. I think it would be good to remove this and then support combining accounts from different slots into a single append-vec. This is the account store code as part of the rent collection: Line 3571 in fa86a33
|
@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way. Multiple
Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc. SingleSame picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed. |
I was thinking the node would just have a single incremental which would replace all previous incremental ones, but I'm open to arguments for the multiple design (or others?) I just think with the multiple you will have a lot of duplicated states updated in the subsequent snapshots, and only 1 is useful. So there will be a lot of overlap. I think there are some small set of accounts updated a lot, like once per slot, and there are many more accounts which are only updated very infrequently, like once in a million or more slots. Hopefully we capture those infrequently updated accounts with the full snapshot, and the incremental captures the frequently updated ones. edit: Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one. |
Sounds good!
I like it. |
I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc. You could even have different threads packaging these at different intervals, or maybe different validators package snapshots at these varying intervals. I.e. some package at a faster rate than others. The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up. This might be also useful for nodes trying to catch up if they can fast sync small incremental snapshots from near the tip of the network. |
@carllin This design sounds like it fall under the "Multiple" category of number-of-incremental-snapshots-between-full-snapshots. Is that right? |
Another way to do it: Split the snapshot across a range of slots, then keep track of every instance of clean_accounts and shrink_slots and then keep a dirty set of stores to then regenerate the new snasphot slices for only those slots that had changed data. |
I think one of the nice things about how we've organized storage entries by slot currently is that the set of storage entries forms a natural diff tracker. So for instance if we were taking a snapshot every 100 slots:
Then as long as you guarantee clean doesn't progress past the slot while you're grabbing a copy of the storage entries ( And this can be expanded to arbitrary intervals, 200, 400, etc. Also while you're generating the 100 slot diff lets say from (200, 300], you could use that to then generate the 200 slot diff from (100, 300] ( For v1 one approach is to just focus on using a fixed, configurable interval |
hey, late for the party. :) I thought on this a bit. and I may have something to add more color. Still inheriting the delta slot interval design discussed so far, I think we can forgo generating those delta snapshots at some guessed intervals. Then we can realize yet faster startup? My idea is to reuse This design can come later, but this mostly overlaps the delta snapshot archive generation and download code. the new restart flow:
my assumptions and observations:
|
OK, looks like there are a few different ways that I could go to implement incremental snapshots (or whatever we call it). I don't know how to quantify what way is the right way to go though. Who should be the one to pick? Or, are there additional data points that I can gather that would make the decision clear? |
@carllin @sakridge could you share initial thoughts on my alternative idea?: #17088 (comment) |
As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.
From my understanding, this is an on demand streaming of AccountsDb storages. However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.) for nodes who crashed/corrupted their serialized/saved state. As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot For instance, if we have some nodes packaging incremental snapshots every 100 slots, and the packaging process takes 500 slots, then:
This implies to me that having some nodes doing regular snapshotting at different varying intervals |
thanks for confirming! yeah, hopefully restarting by itself should make this synchronization requirement easy. :)
Yeah, but I think this should be rare. as for the delta snapshot, I think status cache and bank should be small compared to the accounts storage. So, I omitted to mention. Maybe, the ondemand snapshot endpoint processing need to briefly grab the root bank and stash those binaries and include them into the delta snapshot archive
Thanks for great analysis. :) Firstly, I just noticed that avoidance of purging accounts and recomputing index can be realized for both incremental and ondemand snapshotting if both just restarts with proper coordination like saving the incremental snapshot locally before exiting (CC: @brooksprumo ) Originally, I thought this is only possible because ondemand can freely specify the start of slot for fetching snapshot dynamically. Maybe you can select and download an incremental snapshot which slightly overlaps with the local root to take the same optimization?
Oh, I'm getting a clue. The
yeah. that's true. However, I thought we can offset that delay with vastly reduced network bandwidth by de-duplicating data across equivalent multiple incremental snapshots. As said above, if we can realize 1 delta snapshot download per restart under normal case. the merit of ondemand snapshot is small, though. :) Also, I don't think ondemand snapshot takes so long. Recent appendvecs should generally be on pagecache and index creation is basically linear processing so, bound to disk read bandwidth (unlike incremental snapshot no need to write archive for archive, only needs to grab mmap). |
Handling the zero-lamport account updates sounds hard to me for @ryoqun idea. The target node would have to have the same clean state as the source node or some way to reconcile it. I think initially starting from a known state is an easier solution. I think these on-demand ideas might be good to explore once we have the basic mechanism working. |
Thanks for the input, everyone. I'm going ahead with the single incremental snapshot that will be set at an interval of 100 slots, and full snapshots at 100,000 slots. One of the implementation details is that I'll need to pass around a slightly different set of parameters for an incremental snapshot vs a full snapshot. I was thinking I could either do this by either:
or
There's some pros and cons to both ways, but also neither seems like the clear better way. Given your knowledge of the codebase, and also Rust, does one way sound better than the other? |
Also, thinking about this a bit more, since the current snapshot logic is based on multiples of Account Hash Interval, I could also piggy back on it and do an incremental snapshot for every account hash interval. The existing snapshot functions would need to have parameters added for being able to handle incremental snapshots. Then AccountsHashVerifier would need some more logic to do either an incremental snapshot or a full snapshot based on the snapshot interval. Is that an easier/better than the other two ways? |
Very late to the party here : ) From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot. So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse. |
The incremental snapshot values would be updated at roughly the same speed as snapshot values today. The regular snapshot rate will be reduced significantly. |
Can we consider this alternative implementation of incremental snapshots:
So, effectively the difference is that:
Now, to save disk space, under the hood |
How does the node know that whatever data it got for diff of slot |
This commit builds on PR #18504 by adding a test to core/tests/snapshot.rs for Incremental Snapshots. The test adds banks to bank forks in a loop and takes both full snapshots and incremental snapshots at intervals, and validates they are rebuild-able. For background info about Incremental Snapshots, see #17088. Fixes #18829 and #18972
When reconstructing the AccountsDb, if the storages came from full and incremental snapshots generated on different nodes, it's possible that the AppendVec IDs could overlap/have duplicates, which would cause the reconstruction to fail. This commit handles this issue by unconditionally remapping the AppendVec ID for every AppendVec. Fixes solana-labs#17088
When reconstructing the AccountsDb, if the storages came from full and incremental snapshots generated on different nodes, it's possible that the AppendVec IDs could overlap/have duplicates, which would cause the reconstruction to fail. This commit handles this issue by unconditionally remapping the AppendVec ID for every AppendVec. Fixes solana-labs#17088
DONE! |
When reconstructing the AccountsDb, if the storages came from full and incremental snapshots generated on different nodes, it's possible that the AppendVec IDs could overlap/have duplicates, which would cause the reconstruction to fail. This commit handles this issue by unconditionally remapping the AppendVec ID for every AppendVec. Fixes solana-labs#17088
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs. |
Problem
Startup time for nodes is slow when having to download a large snapshot.
Proposed Solution
Incremental snapshots! Instead of always having to download large, full snapshots, download a full snapshot once/less often, then download a small incremental snapshot. The expectation/hope/thought is that only a small number of accounts are touched often, so use incremental snapshots to optimize for that behavior. At startup, now a node with an existing full snapshot only needs to download a new, small incremental snapshot.
SLOTS_PER_EPOCH / 2 - 1
?Example
ISS C = diff(A, C)
, and so on.Details
Storing an Incremental Snapshot
Loading from an Incremental Snapshot
Validator
Background Services
AccountsDb
clean_accounts()
to add a new parameter,last_full_snapshot_slot
to not clean zero-lamport accounts above the last FSS slotLedger Tool
RPC
Gossip
CrdsData
Bootstrap
Testing
Unit Tests
snapshot_utils
: roundtrip bank to snapshot to bank for FSSsnapshot_utils
: roundtrip bank to snapshot to bank for ISSsnapshot_utils
: cleanup zero-lamport accounts in slots after FSSIntegration Tests
core/tests/snapshots.rs
local_cluster
Questions
how often should incremental snapshots be created?more is good when (re)joining (faster startup time?), but less is good for the running node (less resource utilization?)does it matter when incremental snapshots would be made? Like after/before certain cleanup code?should incremental snapshots only exist locally, or should they also be sent to new nodesi'm guessing we want to send incremental snapshots to new nodes as well, so they start up fasterwhat goes in the incremental snapshot?is it all the same data types as a full snapshot, just the delta since the last snapshot?should full snapshots be created less/same/more frequently now?likely not more... but there for completenessstill need full snapshots for a new node joining the networkwhat tests are needed?obviously a test to make sure it worksensure fallback to full snapshot works if an incremental snapshot is borkedRelated Work
Original Snapshot Work
Future Work
Tasks
snapshot_utils::bank_to_xxx_snapshot_archive()
andcore/tests/snapshots.rs::make_xxx_snapshot_archive()
#18972The text was updated successfully, but these errors were encountered: