-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Snapshot pipeline takes longer on v1.16 over v1.14 #31955
Comments
After letting my v1.16 stackpath node run for a while against mnb, I'm seeing that I'm getting incrementals every ~200 slots, despite the time spent to create the individual increments being on the order of ~20ish seconds. So, the snapshot creation might just be a symptom of the upstream service that issues snapshot requests, |
To more clearly and concretely illustrate why nodes are permanently being observed as behind by
Compared against a v1.14 node on same hardware:
The nodes both observed
So, if the v1.16 compares the slot for its' most recent snapshot (197983161) against the v1.14 node (197983369), it will see itself as behind. And again, this difference is a monitoring thing only; both nodes are keeping up with the tip in terms of processing the newest blocks. |
Oh wow, interesting that the total time through the snapshot pipe is so different on v1.14 and v1.16. Are the numbers in #31955 (comment) on a bare metal machine or a VM? |
Those numbers are both VM's |
Given that this issue was comparing v1.14 v v1.16 and that we have moved mnb to v1.17, I think we can close this issue out. It is not clear to me if we got resolution, but with how many changes have gone in between v1.14 and v1.17, we would seemingly need to start fresh in terms of digging in Brooks - feel free to reopen if you disagree and want to pursue anything further |
Welp, I forgot about this issue... Yeah, we can reopen (or create a new one) if needed later. |
Problem
This issue stems from Discord discussion that took place with operators upgrading their testnet nodes to v1.16. One operator reported seeing
getHealth
reporting their node as unhealthy much more frequently with v1.16 than v1.14; this was specifically on a Stackpath VM. Looking at metrics, the node in question was doing just fine, and was running with the tip of the cluster. I was then able to obtain a Stackpath VM and reproduce the same situation; the node was keeping up with the cluster butgetHealth
(viasolana-validator monitor
) was reporting the node was unhealthy. One obvious thing to note off the bat is the change in behavior between v1.14 and v1.16 to put the accounts index on disk (v1.16) instead of memory (v1.14) by default.Given that the node was processing blocks just fine, I began digging into
getHealth
. ThegetHealth
RPC method has some subtlety with how it works, see #16957 (comment) for more details. Given thatgetHealth
is dependent on snapshots slots getting reported in gossip, I then began looking at how the node was doing with snapshot creation. My nodes are using the default interval snapshots, 25k for full and 100 for incrementals.Here are some testnet logs for a full and the two incrementals before and after the full, running v1.16:
Here are some mainnet beta logs for a full and the two incrementals before and after the full, running v1.16:
100 slots is ~40-50 seconds based on recent cluster averages; if a snapshot takes longer than this to create, then there would obviously be a backup of requests. I believe when snapshots (and related work in the pipelien) take longer than this interval, snapshot requests will get "dropped". For example,
SnapshotPackagerService::get_next_snapshot_package()
will pick the highest priority request and discard any requests with a slot older than the request being handled.getHealth
will return healthy after a new full snapshot has been created, transition to unhealthy at some point and continue to report unhealthy until the next full is created.Proposed Solution
Part of writing up this issue was just to do a brain dump before the weekend. Compared to bare metal machines, I think the virtual machines in general are not as performant when it comes to disk. So, there may not be much we can do in terms of further optimization except for suggesting operators to use
tmpfs
if they have spare RAM. But, it would still be nice ifgetHealth
worked even if the node doesn't have the I/O capabilities to create a snapshot every 100 slots, so maybe we should instead revisit solving #16957.The text was updated successfully, but these errors were encountered: