Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epoch Sync] Make epoch sync happen before header sync on AwaitingPeers. #12563

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

robin-near
Copy link
Contributor

@robin-near robin-near commented Dec 5, 2024

I think if I understand correctly, the way the AwaitingPeers state works is simply a marker for the starting state. The mechanism by which we transition away from the AwaitingPeers state is by header sync replacing it with HeaderSync when there are enough peers to run the header sync code at all.

So, before this PR, what would happen is that we start with AwaitingPeers, and epoch sync will see that and say "oh we don't have enough peers, so let's skip", but then header sync takes the stage and starts syncing headers. This ruins the header_head by moving it away from genesis, making epoch sync no longer eligible. In fact, this happens pretty reliably because at startup we would always perform header sync first before performing epoch sync, and since epoch sync is most likely slower than the first header sync response, we're continuing epoch sync with an incorrect header_head (causing either an almost-correct proof application, or a stall if the epoch sync request fails).

There are a few more hardening fixes that we should consider, but for now, this should fix the root cause, by no longer treating AwaitingPeers as special. By the way we'll also not treat StateSync as special, because that just can't be possible if the header_head is at genesis.

@robin-near robin-near requested a review from a team as a code owner December 5, 2024 21:36
@robin-near robin-near added this pull request to the merge queue Dec 6, 2024
Merged via the queue into near:master with commit ebb4717 Dec 6, 2024
22 checks passed
@robin-near robin-near deleted the esyncfix branch December 6, 2024 05:35
staffik pushed a commit that referenced this pull request Dec 6, 2024
…rs. (#12563)

I think if I understand correctly, the way the AwaitingPeers state works
is simply a marker for the starting state. The mechanism by which we
transition away from the AwaitingPeers state is by header sync replacing
it with HeaderSync when there are enough peers to run the header sync
code at all.

So, before this PR, what would happen is that we start with
AwaitingPeers, and epoch sync will see that and say "oh we don't have
enough peers, so let's skip", but then header sync takes the stage and
starts syncing headers. This ruins the header_head by moving it away
from genesis, making epoch sync no longer eligible. In fact, this
happens pretty reliably because at startup we would always perform
header sync first before performing epoch sync, and since epoch sync is
most likely slower than the first header sync response, we're continuing
epoch sync with an incorrect header_head (causing either an
almost-correct proof application, or a stall if the epoch sync request
fails).

There are a few more hardening fixes that we should consider, but for
now, this should fix the root cause, by no longer treating AwaitingPeers
as special. By the way we'll also not treat StateSync as special,
because that just can't be possible if the header_head is at genesis.
staffik added a commit that referenced this pull request Dec 6, 2024
[Epoch Sync] Make epoch sync happen before header sync on AwaitingPeers.
(#12563)

I think if I understand correctly, the way the AwaitingPeers state works
is simply a marker for the starting state. The mechanism by which we
transition away from the AwaitingPeers state is by header sync replacing
it with HeaderSync when there are enough peers to run the header sync
code at all.

So, before this PR, what would happen is that we start with
AwaitingPeers, and epoch sync will see that and say "oh we don't have
enough peers, so let's skip", but then header sync takes the stage and
starts syncing headers. This ruins the header_head by moving it away
from genesis, making epoch sync no longer eligible. In fact, this
happens pretty reliably because at startup we would always perform
header sync first before performing epoch sync, and since epoch sync is
most likely slower than the first header sync response, we're continuing
epoch sync with an incorrect header_head (causing either an
almost-correct proof application, or a stall if the epoch sync request
fails).

There are a few more hardening fixes that we should consider, but for
now, this should fix the root cause, by no longer treating AwaitingPeers
as special. By the way we'll also not treat StateSync as special,
because that just can't be possible if the header_head is at genesis.

Co-authored-by: robin-near <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants