-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Ignore sync errors when the block is already verified #980
fix: Ignore sync errors when the block is already verified #980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, I just have one suggestion on how we could reduce the indentation a little bit.
If we get an error for a block that is already in our state, we don't need to restart the sync. It was probably a duplicate download. Also: Process any ready tasks before reset, so the logs and metrics are up to date. (But ignore the errors, because we're about to reset.) Improve sync logging and metrics during the download and verify task.
Co-authored-by: Jane Lusby <[email protected]>
b528720
to
337d9f8
Compare
I don't think this is a good change, because the error that it ignores is a sign that something has gone wrong with the sync process and so the sync state is no longer valid. Rather than continuing, it should restart the sync process and discard the invalid state. |
Because the sync service discards state, each sync failure causes the next sync to download some duplicate blocks. This is particularly common during checkpoint verification, when there are a lot of queued blocks. If we restart the next sync on duplicate blocks, then we cause repeated sync failures - even though there was only one original error. Because of these repeated failures, each triggering the next failure, Zebra can get into a state where it is only failing, and never making progress. |
Can we fix that problem directly, either by (a) cancelling requests we're no longer interested in or (b) making our verification checks handle duplicate blocks better, rather than expanding the state machine of the sync component? Adding attempted recovery from partial failures adds a bunch of new state transitions, but the state machine for the sync component is already very complicated. This error occurs only when the sync invariants have already been violated, meaning that the sync component is in an invalid state. I'm not sure how we can keep going, if we know that our current state is invalid. |
The concern about the retry in #993 (comment) is essentially the same kind of concern about state machine complexity. |
I agree with your concern about complexity. But I'm not sure I understand your analysis of the edge cases and implementation details here. Here's my understanding of our current design constraints:
We spawn a task for each download and verify. I don't know how to cancel a spawned task.
What should we do when we receive a duplicate block? In the BlockVerifier:
In the CheckpointVerifier:
We're going to have similar issues with contextual verification in the state, because it also has a queue of blocks. So this is definitely a problem worth solving.
What if we have already reset the state corresponding to the invalid blocks? Because of the race conditions between the sync service and the verifiers, we don't know that our current state is invalid. All we know is that at least one of our current or previous states was invalid. I don't know how to reset sync state, download state, and verifier state at the same time. And without synchronised resets, we can't rely on errors to tell us about sync invariant violations - because the errors depend on state outside the syncer. |
Also, the inbound service could potentially download the same blocks as the sync service, leading to duplicate block errors in whichever service loses the race. I'm starting to wonder if a different design would be helpful here:
The Queue never resets state - its state clears automatically after the task timeout. If the Sync or Inbound service don't like the results they are getting from the Queue, it's their job to clear their own state, disconnect peers, or otherwise manage the requests they are making to the Queue. I think this design would give us 3 simpler state machines with clear dependencies, rather than a single Sync state machine that's trying to do too much. (Including things the inbound service wants to know about.) |
Here are the invariants that are preserved by this design:
Here are some other useful properties, which may vary over time or state:
Here are some quirks we might want to fix later:
|
If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.
This change is part of a set of related changes which increase sync
reliability.
Also:
Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)
Improve sync logging and metrics during the download and verify task.