Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Snapshot pipeline takes longer on v1.16 over v1.14 #31955

Closed
steviez opened this issue Jun 2, 2023 · 7 comments
Closed

Snapshot pipeline takes longer on v1.16 over v1.14 #31955

steviez opened this issue Jun 2, 2023 · 7 comments

Comments

@steviez
Copy link
Contributor

steviez commented Jun 2, 2023

Problem

This issue stems from Discord discussion that took place with operators upgrading their testnet nodes to v1.16. One operator reported seeing getHealth reporting their node as unhealthy much more frequently with v1.16 than v1.14; this was specifically on a Stackpath VM. Looking at metrics, the node in question was doing just fine, and was running with the tip of the cluster. I was then able to obtain a Stackpath VM and reproduce the same situation; the node was keeping up with the cluster but getHealth (via solana-validator monitor) was reporting the node was unhealthy. One obvious thing to note off the bat is the change in behavior between v1.14 and v1.16 to put the accounts index on disk (v1.16) instead of memory (v1.14) by default.

Given that the node was processing blocks just fine, I began digging into getHealth. The getHealth RPC method has some subtlety with how it works, see #16957 (comment) for more details. Given that getHealth is dependent on snapshots slots getting reported in gossip, I then began looking at how the node was doing with snapshot creation. My nodes are using the default interval snapshots, 25k for full and 100 for incrementals.

Here are some testnet logs for a full and the two incrementals before and after the full, running v1.16:

[2023-06-02T20:36:27.884227872Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-200508405-200535193-3fyvuBfVQMuU2gepu7p7X7TEUdCYpnhTSX9fhPN4NQqm.tar.zst".
slot: 200535193, elapsed ms: 11230, size=387555674
[2023-06-02T20:37:23.965613366Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-200508405-200535406-4eD5wqAzyAKDyo78BF2bq5uDMY4LRWVWYuUuhZDS789s.tar.zst".
slot: 200535406, elapsed ms: 11657, size=388754183
[2023-06-02T20:40:22.978880499Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/snapshot-200535507-CDM4GWBBeNJW6VHP7Wwuddm9qSq8hxn4Sj1W8G8NWwPm.tar.zst".
slot: 200535507, elapsed ms: 136269, size=3441934951
[2023-06-02T20:40:30.646493166Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-200535507-200535708-JWc5hxQ2jkR2RiKyYtunabGu5rdMybadAc4kUAzMAM3.tar.zst".
slot: 200535708, elapsed ms: 3653, size=191278617
[2023-06-02T20:40:55.913538648Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-200535507-200535809-ANCQE69QFoJLyZF5dtfhYCDP2MtmRPtky8YpT53awLX9.tar.zst".
slot: 200535809, elapsed ms: 3492, size=190852145

Here are some mainnet beta logs for a full and the two incrementals before and after the full, running v1.16:

[2023-06-02T22:30:38.624975272Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197400570-197432087-3ccDHZyX2cKy1Dh4V6cRZ88zuQFgmH5pQ2chBgD4jTYv.tar.zst".
slot: 197432087, elapsed ms: 70758, size=3696089330
[2023-06-02T22:31:48.230508491Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197400570-197432187-2J7Hi1JtwVpkT4nE6Sr2vZBnEg1ZgHA6364TumSKaE5n.tar.zst".
slot: 197432187, elapsed ms: 69096, size=3706613807
[2023-06-02T22:56:22.589718266Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/snapshot-197432391-9M1fFZLa8dVEpkHCPa6UkeqYawdU9MmhKt2PjHgU68u5.tar.zst".
slot: 197432391, elapsed ms: 1473812, size=46350346600
[2023-06-02T22:56:43.866687049Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197432391-197435550-6sdQqTeb7cJyshr8UarTUHUtDjYJRJvRsxGX3SRxaBVU.tar.zst".
slot: 197435550, elapsed ms: 13030, size=608019097
[2023-06-02T22:57:06.684460213Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197432391-197435650-5xgctxEacvKVoZnGorYLcwi43bNdtMLGS8r8XEcFR6P3.tar.zst".
slot: 197435650, elapsed ms: 10959, size=619045185

100 slots is ~40-50 seconds based on recent cluster averages; if a snapshot takes longer than this to create, then there would obviously be a backup of requests. I believe when snapshots (and related work in the pipelien) take longer than this interval, snapshot requests will get "dropped". For example, SnapshotPackagerService::get_next_snapshot_package() will pick the highest priority request and discard any requests with a slot older than the request being handled.

  • For testnet, this was seemingly fairly infrequent as the log above would show that even the largest incrementals (those with the largest slot distance from their corresponding full snapshot) were getting creating in < 15 seconds. However, the logs above do show that the two incrementals before the full were 200 slots apart despite the first one only taking ~11.5 seconds to create.
  • For mainnet beta, the largest mainnet incrementals are exceeding 60 seconds to create whereas the smallest ones are back around 10-15 seconds. Thus, there is a pattern where getHealth will return healthy after a new full snapshot has been created, transition to unhealthy at some point and continue to report unhealthy until the next full is created.

Proposed Solution

Part of writing up this issue was just to do a brain dump before the weekend. Compared to bare metal machines, I think the virtual machines in general are not as performant when it comes to disk. So, there may not be much we can do in terms of further optimization except for suggesting operators to use tmpfs if they have spare RAM. But, it would still be nice if getHealth worked even if the node doesn't have the I/O capabilities to create a snapshot every 100 slots, so maybe we should instead revisit solving #16957.

@steviez
Copy link
Contributor Author

steviez commented Jun 3, 2023

After letting my v1.16 stackpath node run for a while against mnb, I'm seeing that I'm getting incrementals every ~200 slots, despite the time spent to create the individual increments being on the order of ~20ish seconds. So, the snapshot creation might just be a symptom of the upstream service that issues snapshot requests, AccoutnsHashVerifier, feeding SnapshotPackageManager at a slower rate

@steviez
Copy link
Contributor Author

steviez commented Jun 5, 2023

From some Discord chatter, I learned that the num_outstanding_snapshots_packages field under the snapshot_package_service metric tracks the number of requests that SnapshotPackagerService picks from. Below is a capture of this metric over the last two days (D4MDYDFZivQjdzTFJx2NNnaqB2HG9Xqd5tUsDA4iAcbw against mnb):
image
This metric is typically 1 which is good, but there are large backups whenever fulls are getting created. This is coupled to the handling_time_us field of same metric, which tracks how long it takes the request to be processed. Same thing, large jump for the fulls:
image

@steviez
Copy link
Contributor Author

steviez commented Jun 5, 2023

To more clearly and concretely illustrate why nodes are permanently being observed as behind by getHealth, we can compare the timestamps of when a slot is rooted vs. when the snapshot is created. For example, 197983161 was a snapshot slot on my node. Both of these logs come from Stackpath servers. The below timestamps shows when that slot was rooted and then when snapshot was created on a v1.16 node:

[2023-06-05T22:31:23.683691593Z INFO  solana_core::replay_stage] new root 197983161

[2023-06-05T22:34:34.189225606Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197971281-197983161-5UuNanGnphZKue33cnJ2GJqGB3dFksfouM5Qroa5HFPw.tar.zst". slot: 197983161, elapsed ms: 31476, size=1584284735

Compared against a v1.14 node on same hardware:

[2023-06-05T22:31:23.983407979Z INFO  solana_core::replay_stage] new root 197983161

[2023-06-05T22:32:36.157435573Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197971281-197983161-5UuNanGnphZKue33cnJ2GJqGB3dFksfouM5Qroa5HFPw.tar.zst". slot: 197983161, elapsed ms: 30746, size=1573826677

The nodes both observed 197983161 at effectively the same time. However, the v1.14 node was able to get the accompanying snapshot out ~73 seconds after rooting, whereas the v1.16 node took ~191 seconds. In this time, the v1.14 node has already pushed out several more snapshots:

[2023-06-05T22:33:20.691685668Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197971281-197983261-BnqzMpjxv7LVsW81wr7JYke4TBFpmkUUTop2nss57sZR.tar.zst". slot: 197983261, elapsed ms: 29638, size=1582767238

[2023-06-05T22:34:10.967753071Z INFO  solana_runtime::snapshot_utils]
Successfully created "/var/lib/data/ledger/incremental-snapshot-197971281-197983369-2w8cQjHYSzNuqzeM7EmxK6jDyQLK7N2cy3vCvKGdb6QB.tar.zst". slot: 197983369, elapsed ms: 29911, size=1593835013

So, if the v1.16 compares the slot for its' most recent snapshot (197983161) against the v1.14 node (197983369), it will see itself as behind. And again, this difference is a monitoring thing only; both nodes are keeping up with the tip in terms of processing the newest blocks.

@brooksprumo
Copy link
Contributor

Oh wow, interesting that the total time through the snapshot pipe is so different on v1.14 and v1.16. Are the numbers in #31955 (comment) on a bare metal machine or a VM?

@steviez
Copy link
Contributor Author

steviez commented Jun 6, 2023

Oh wow, interesting that the total time through the snapshot pipe is so different on v1.14 and v1.16. Are the numbers in #31955 (comment) on a bare metal machine or a VM?

Those numbers are both VM's

@steviez steviez changed the title Snapshot requests getting dropped affecting node getHealth status Snapshot pipeline takes longer on v1.16 over v1.14 Jun 6, 2023
@steviez
Copy link
Contributor Author

steviez commented Jan 25, 2024

Given that this issue was comparing v1.14 v v1.16 and that we have moved mnb to v1.17, I think we can close this issue out. It is not clear to me if we got resolution, but with how many changes have gone in between v1.14 and v1.17, we would seemingly need to start fresh in terms of digging in

Brooks - feel free to reopen if you disagree and want to pursue anything further

@steviez steviez closed this as completed Jan 25, 2024
@brooksprumo
Copy link
Contributor

Welp, I forgot about this issue... Yeah, we can reopen (or create a new one) if needed later.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants