Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state sync: improvements from mocknet testing #12507

Merged
merged 5 commits into from
Dec 4, 2024

Conversation

saketh-are
Copy link
Collaborator

@saketh-are saketh-are commented Nov 24, 2024

This PR contains several fixes which improve the speed and robustness of state sync:

  • All part requests to peers are now made before all cloud attempts. Previously we focused on obtaining specific parts one by one, which could cause a thread to block for a long time until a particular part was uploaded to cloud storage. It takes tens of minutes after the epoch ends for dumper nodes to write all state parts to cloud storage, whereas peer hosts are ready to serve all requests as soon as the epoch ends.
  • Part request order is randomized at each syncing node, preventing spikes in demand to specific hosts.
  • Removes an unnecessary check for state headers when serving state parts. In some cases this was preventing peer hosts which do not track all shards from responding successfully to part requests.

Before these changes, it took up to 75 minutes for nodes to download parts for the largest shard (38.8 GiB in 1324 parts). After these changes:

  • Nodes consistently finish downloading parts in under 15 min,
  • State requests to peer hosts have a failure rate below 1%,
  • and 100% of parts are successfully obtained from peer hosts within three requests.
Screenshot 2024-11-24 at 7 17 39 AM

Additional minor improvements:

  • Adds a separate config parameter to specify how long to wait after a failed cloud download. This allows nodes to avoid spamming requests to cloud storage before parts have been uploaded.
  • Adds metrics recording the number of cache hits and misses when serving state part requests.
  • Distinguishes different types of errors collected in the near_state_sync_download_result metric.

Closes issues #12497, #12498, #12499

Once merged this PR should be cherry-picked into the 2.4 release. cc @staffik

Copy link

codecov bot commented Nov 24, 2024

Codecov Report

Attention: Patch coverage is 86.17886% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.12%. Comparing base (fc9aaa3) to head (0dd7a1c).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/metrics.rs 50.00% 7 Missing ⚠️
chain/client/src/sync/state/network.rs 0.00% 4 Missing ⚠️
chain/client/src/sync/state/external.rs 84.21% 2 Missing and 1 partial ⚠️
chain/client/src/sync/state/downloader.rs 91.66% 1 Missing and 1 partial ⚠️
chain/chain/src/chain.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #12507   +/-   ##
=======================================
  Coverage   70.12%   70.12%           
=======================================
  Files         841      841           
  Lines      169839   169901   +62     
  Branches   169839   169901   +62     
=======================================
+ Hits       119099   119149   +50     
- Misses      45597    45614   +17     
+ Partials     5143     5138    -5     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.29% <4.95%> (+<0.01%) ⬆️
linux 69.36% <80.48%> (+0.01%) ⬆️
linux-nightly 69.72% <86.17%> (+<0.01%) ⬆️
pytests 1.59% <4.95%> (+<0.01%) ⬆️
sanity-checks 1.40% <4.95%> (+<0.01%) ⬆️
unittests 69.94% <86.17%> (+<0.01%) ⬆️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@saketh-are saketh-are marked this pull request as ready for review November 24, 2024 17:41
@saketh-are saketh-are requested a review from a team as a code owner November 24, 2024 17:41
Copy link
Contributor

@VanBarbascu VanBarbascu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

}
res
}
.instrument(tracing::debug_span!("StateSyncDownloader::ensure_shard_part_downloaded"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name of debug span

@saketh-are saketh-are enabled auto-merge December 1, 2024 20:54
@saketh-are saketh-are added this pull request to the merge queue Dec 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 2, 2024
@saketh-are saketh-are added this pull request to the merge queue Dec 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 4, 2024
@saketh-are saketh-are added this pull request to the merge queue Dec 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 4, 2024
@saketh-are saketh-are enabled auto-merge December 4, 2024 14:13
@saketh-are saketh-are added this pull request to the merge queue Dec 4, 2024
Merged via the queue into near:master with commit 71e0447 Dec 4, 2024
27 checks passed
@saketh-are saketh-are deleted the saketh-state-sync branch December 4, 2024 14:59
staffik pushed a commit that referenced this pull request Dec 4, 2024
This PR contains several fixes which improve the speed and robustness of
state sync:
- **All part requests to peers are now made before all cloud attempts**.
Previously we focused on obtaining specific parts one by one, which
could cause a thread to block for a long time until a particular part
was uploaded to cloud storage. It takes tens of minutes after the epoch
ends for dumper nodes to write all state parts to cloud storage, whereas
peer hosts are ready to serve all requests as soon as the epoch ends.
- **Part request order is randomized at each syncing node**, preventing
spikes in demand to specific hosts.
- **Removes an unnecessary check for state headers when serving state
parts**. In some cases this was preventing peer hosts which do not track
all shards from responding successfully to part requests.

Before these changes, it took up to 75 minutes for nodes to download
parts for the largest shard (38.8 GiB in 1324 parts). After these
changes:
* Nodes consistently finish downloading parts in under 15 min,
* State requests to peer hosts have a failure rate below 1%,
* and 100% of parts are successfully obtained from peer hosts within
three requests.
<img width="1374" alt="Screenshot 2024-11-24 at 7 17 39 AM"
src="https://github.com/user-attachments/assets/90537548-514b-49b6-87aa-e08b21a24f86">

-----

Additional minor improvements:
- Adds a separate config parameter to specify how long to wait after a
failed cloud download. This allows nodes to avoid spamming requests to
cloud storage before parts have been uploaded.
- Adds metrics recording the number of cache hits and misses when
serving state part requests.
- Distinguishes different types of errors collected in the
`near_state_sync_download_result` metric.

Closes issues #12497, #12498, #12499

Once merged this PR should be cherry-picked into the 2.4 release. cc
@staffik
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants