Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ignore sync errors when the block is already verified #980

Merged

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Sep 1, 2020

If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.

This change is part of a set of related changes which increase sync
reliability.

Also:

Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)

Improve sync logging and metrics during the download and verify task.

@teor2345 teor2345 added E-hard A-rust Area: Updates to Rust code C-enhancement Category: This is an improvement labels Sep 1, 2020
@teor2345 teor2345 requested a review from a team September 1, 2020 08:40
@teor2345 teor2345 self-assigned this Sep 1, 2020
zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
@teor2345 teor2345 requested a review from a team September 2, 2020 01:00
yaahc
yaahc previously approved these changes Sep 3, 2020
Copy link
Contributor

@yaahc yaahc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, I just have one suggestion on how we could reduce the indentation a little bit.

zebrad/src/commands/start/sync.rs Outdated Show resolved Hide resolved
teor2345 and others added 6 commits September 3, 2020 12:36
If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.

Also:

Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)

Improve sync logging and metrics during the download and verify task.
@teor2345 teor2345 force-pushed the ignore-duplicate-block-errors branch from b528720 to 337d9f8 Compare September 3, 2020 02:36
@teor2345 teor2345 merged commit 48497d4 into ZcashFoundation:main Sep 3, 2020
@hdevalence
Copy link
Contributor

I don't think this is a good change, because the error that it ignores is a sign that something has gone wrong with the sync process and so the sync state is no longer valid.

Rather than continuing, it should restart the sync process and discard the invalid state.

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 7, 2020

I don't think this is a good change, because the error that it ignores is a sign that something has gone wrong with the sync process and so the sync state is no longer valid.

Rather than continuing, it should restart the sync process and discard the invalid state.

Because the sync service discards state, each sync failure causes the next sync to download some duplicate blocks.

This is particularly common during checkpoint verification, when there are a lot of queued blocks.

If we restart the next sync on duplicate blocks, then we cause repeated sync failures - even though there was only one original error.

Because of these repeated failures, each triggering the next failure, Zebra can get into a state where it is only failing, and never making progress.

@hdevalence
Copy link
Contributor

Can we fix that problem directly, either by (a) cancelling requests we're no longer interested in or (b) making our verification checks handle duplicate blocks better, rather than expanding the state machine of the sync component?

Adding attempted recovery from partial failures adds a bunch of new state transitions, but the state machine for the sync component is already very complicated. This error occurs only when the sync invariants have already been violated, meaning that the sync component is in an invalid state. I'm not sure how we can keep going, if we know that our current state is invalid.

@hdevalence
Copy link
Contributor

The concern about the retry in #993 (comment) is essentially the same kind of concern about state machine complexity.

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 9, 2020

I agree with your concern about complexity. But I'm not sure I understand your analysis of the edge cases and implementation details here.

Here's my understanding of our current design constraints:

Can we fix that problem directly, either by (a) cancelling requests we're no longer interested in

We spawn a task for each download and verify. I don't know how to cancel a spawned task.

(b) making our verification checks handle duplicate blocks better

What should we do when we receive a duplicate block?

In the BlockVerifier:

  • we don't want to verify blocks twice, because that wastes CPU
  • if the block is in the state, we can return success (or an error) and drop the block

In the CheckpointVerifier:

  • if the block is in the queue, we don't want to keep it around until we know the result of the checkpoint, because that wastes RAM
    • I don't know how to verify a block once, but send the result on two channels
  • if the block is in the state, we can return success (or an error) and drop the block

We're going to have similar issues with contextual verification in the state, because it also has a queue of blocks. So this is definitely a problem worth solving.

Adding attempted recovery from partial failures adds a bunch of new state transitions, but the state machine for the sync component is already very complicated. This error occurs only when the sync invariants have already been violated, meaning that the sync component is in an invalid state. I'm not sure how we can keep going, if we know that our current state is invalid.

What if we have already reset the state corresponding to the invalid blocks?

Because of the race conditions between the sync service and the verifiers, we don't know that our current state is invalid. All we know is that at least one of our current or previous states was invalid.

I don't know how to reset sync state, download state, and verifier state at the same time. And without synchronised resets, we can't rely on errors to tell us about sync invariant violations - because the errors depend on state outside the syncer.

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 9, 2020

Also, the inbound service could potentially download the same blocks as the sync service, leading to duplicate block errors in whichever service loses the race.

I'm starting to wonder if a different design would be helpful here:

  • Split the "download and verify" task handling into a separate "Queue" component
  • The sync and inbound services submit block hashes to the new Queue component
  • If a duplicate hash is submitted, the Queue hands out a clone of a "watch" (spmc) receiver for that hash: https://docs.rs/tokio/0.2.22/tokio/sync/index.html#watch-channel
  • if the block downloads, verifies, and commits to the state within a timeout, all the watchers get a success
  • otherwise, all the watchers get a failure
  • before downloading each block, the Queue checks the state, and returns success if the block is in the state
  • if there are too many pending blocks, the Queue exerts backpressure on both the sync and inbound services

The Queue never resets state - its state clears automatically after the task timeout.

If the Sync or Inbound service don't like the results they are getting from the Queue, it's their job to clear their own state, disconnect peers, or otherwise manage the requests they are making to the Queue.

I think this design would give us 3 simpler state machines with clear dependencies, rather than a single Sync state machine that's trying to do too much. (Including things the inbound service wants to know about.)

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 9, 2020

Here are the invariants that are preserved by this design:

  • all concurrent requests get the same response
  • once a hash has succeeded once, it always succeeds (unless it is pruned from the in-memory state as a side-chain)
  • there are no race conditions between checking the result of a task, and checking the state

Here are some other useful properties, which may vary over time or state:

  • a block that depends on other blocks will wait for them until a timeout
  • the timeout rate-limits some failures

Here are some quirks we might want to fix later:

  • a failed block can be retried immediately (there is no negative response cache)
  • there is no rate limit on immediate download or verify failures
  • a block that depends on the local node time will fail immediately, rather than awaiting a future time at which it would succeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-enhancement Category: This is an improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants