fix: Ignore sync errors when the block is already verified #980

teor2345 · 2020-09-01T08:40:03Z

If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.

This change is part of a set of related changes which increase sync
reliability.

Also:

Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)

Improve sync logging and metrics during the download and verify task.

zebrad/src/commands/start/sync.rs

yaahc

looks good to me, I just have one suggestion on how we could reduce the indentation a little bit.

zebrad/src/commands/start/sync.rs

If we get an error for a block that is already in our state, we don't need to restart the sync. It was probably a duplicate download. Also: Process any ready tasks before reset, so the logs and metrics are up to date. (But ignore the errors, because we're about to reset.) Improve sync logging and metrics during the download and verify task.

Co-authored-by: Jane Lusby <[email protected]>

hdevalence · 2020-09-07T20:43:23Z

I don't think this is a good change, because the error that it ignores is a sign that something has gone wrong with the sync process and so the sync state is no longer valid.

Rather than continuing, it should restart the sync process and discard the invalid state.

teor2345 · 2020-09-07T22:36:06Z

I don't think this is a good change, because the error that it ignores is a sign that something has gone wrong with the sync process and so the sync state is no longer valid.

Rather than continuing, it should restart the sync process and discard the invalid state.

Because the sync service discards state, each sync failure causes the next sync to download some duplicate blocks.

This is particularly common during checkpoint verification, when there are a lot of queued blocks.

If we restart the next sync on duplicate blocks, then we cause repeated sync failures - even though there was only one original error.

Because of these repeated failures, each triggering the next failure, Zebra can get into a state where it is only failing, and never making progress.

hdevalence · 2020-09-08T20:54:56Z

Can we fix that problem directly, either by (a) cancelling requests we're no longer interested in or (b) making our verification checks handle duplicate blocks better, rather than expanding the state machine of the sync component?

Adding attempted recovery from partial failures adds a bunch of new state transitions, but the state machine for the sync component is already very complicated. This error occurs only when the sync invariants have already been violated, meaning that the sync component is in an invalid state. I'm not sure how we can keep going, if we know that our current state is invalid.

hdevalence · 2020-09-08T20:56:56Z

The concern about the retry in #993 (comment) is essentially the same kind of concern about state machine complexity.

teor2345 · 2020-09-09T03:05:53Z

I agree with your concern about complexity. But I'm not sure I understand your analysis of the edge cases and implementation details here.

Here's my understanding of our current design constraints:

Can we fix that problem directly, either by (a) cancelling requests we're no longer interested in

We spawn a task for each download and verify. I don't know how to cancel a spawned task.

(b) making our verification checks handle duplicate blocks better

What should we do when we receive a duplicate block?

In the BlockVerifier:

we don't want to verify blocks twice, because that wastes CPU
if the block is in the state, we can return success (or an error) and drop the block

In the CheckpointVerifier:

if the block is in the queue, we don't want to keep it around until we know the result of the checkpoint, because that wastes RAM
- I don't know how to verify a block once, but send the result on two channels
if the block is in the state, we can return success (or an error) and drop the block

We're going to have similar issues with contextual verification in the state, because it also has a queue of blocks. So this is definitely a problem worth solving.

Adding attempted recovery from partial failures adds a bunch of new state transitions, but the state machine for the sync component is already very complicated. This error occurs only when the sync invariants have already been violated, meaning that the sync component is in an invalid state. I'm not sure how we can keep going, if we know that our current state is invalid.

What if we have already reset the state corresponding to the invalid blocks?

Because of the race conditions between the sync service and the verifiers, we don't know that our current state is invalid. All we know is that at least one of our current or previous states was invalid.

I don't know how to reset sync state, download state, and verifier state at the same time. And without synchronised resets, we can't rely on errors to tell us about sync invariant violations - because the errors depend on state outside the syncer.

teor2345 · 2020-09-09T03:36:17Z

Also, the inbound service could potentially download the same blocks as the sync service, leading to duplicate block errors in whichever service loses the race.

I'm starting to wonder if a different design would be helpful here:

Split the "download and verify" task handling into a separate "Queue" component
The sync and inbound services submit block hashes to the new Queue component
If a duplicate hash is submitted, the Queue hands out a clone of a "watch" (spmc) receiver for that hash: https://docs.rs/tokio/0.2.22/tokio/sync/index.html#watch-channel
if the block downloads, verifies, and commits to the state within a timeout, all the watchers get a success
otherwise, all the watchers get a failure
before downloading each block, the Queue checks the state, and returns success if the block is in the state
if there are too many pending blocks, the Queue exerts backpressure on both the sync and inbound services

The Queue never resets state - its state clears automatically after the task timeout.

If the Sync or Inbound service don't like the results they are getting from the Queue, it's their job to clear their own state, disconnect peers, or otherwise manage the requests they are making to the Queue.

I think this design would give us 3 simpler state machines with clear dependencies, rather than a single Sync state machine that's trying to do too much. (Including things the inbound service wants to know about.)

teor2345 · 2020-09-09T04:00:30Z

Here are the invariants that are preserved by this design:

all concurrent requests get the same response
once a hash has succeeded once, it always succeeds (unless it is pruned from the in-memory state as a side-chain)
there are no race conditions between checking the result of a task, and checking the state

Here are some other useful properties, which may vary over time or state:

a block that depends on other blocks will wait for them until a timeout
the timeout rate-limits some failures

Here are some quirks we might want to fix later:

a failed block can be retried immediately (there is no negative response cache)
there is no rate limit on immediate download or verify failures
a block that depends on the local node time will fail immediately, rather than awaiting a future time at which it would succeed

teor2345 added E-hard A-rust Area: Updates to Rust code C-enhancement Category: This is an improvement labels Sep 1, 2020

teor2345 requested a review from a team September 1, 2020 08:40

teor2345 self-assigned this Sep 1, 2020

teor2345 commented Sep 1, 2020

View reviewed changes