Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consensus failure due to incorrect crankhash synchronization #3720

Closed
dckc opened this issue Aug 18, 2021 · 12 comments
Closed

consensus failure due to incorrect crankhash synchronization #3720

dckc opened this issue Aug 18, 2021 · 12 comments
Labels
bug Something isn't working

Comments

@dckc
Copy link
Member

dckc commented Aug 18, 2021

Describe the bug

a bit vague, for now

@michaelfig restarted one validator at block=1265, and then got CONSENSUS FAILURE on it at block=1294.

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Platform Environment

  • what OS are you using? what version of Node.js?
  • is there anything special/unusual about your platform?
  • what version of the Agoric-SDK are you using? (run git describe --tags --always)
    35cb64e I think. i.e. @agoric/[email protected]

Additional context

Add any other context about the problem here.

Screenshots

If applicable, add screenshots to help explain your problem, especially for UI interactions.

@warner
Copy link
Member

warner commented Aug 18, 2021

The latest experiment, with two validators ("good" and "bad"). We bounced "bad" between block 92 and 93. Nothing happens until the next timer wakeup, which must do a timerDevice.poll(blockTime) during the BEGIN_BLOCK phase for block 114, which must 1: do four SO(handler).wake(time) calls (pushed onto the run-queue), and 2: do a setDeviceState() (updating the current time). The subsequent controller.run() performs four wake deliveries to the timer vat, none of which cause any syscalls (because.. nobody is waiting on them? then why did the timer device choose to tell the timer vat??).

The first such delivery reports a different crankhash in the restarted ("bad") validator than the other. This suggests that one of the state writes was different in that validator. The run-queue writes are probably the same (as evidenced by the same deliveries being made), but the setDeviceState could have been different, or it could have happened a different number of times. Either would cause the crankhash to diverge.

@warner
Copy link
Member

warner commented Aug 18, 2021

I instrumented storageWrapper.js to log each time a DB key is changed, and it looks like there's an extra call to kernelKeeper.incrementCrankNumber() at the end of kernel.js start() which causes a DB write during restart that doesn't happen otherwise. That would be enough to throw off the crankhash.

Simply removing the increment call causes tests to fail: some are trivial, others are legitimate. I think there's a subtle fencepost problem happening that necessitated the increment, however the extra DB write is a problem. @FUDCo I'll be hitting you up for whatever you can remember about that increment call.

@FUDCo
Copy link
Contributor

FUDCo commented Aug 18, 2021

I'm fairly sure that increment is supposed to be there, but I think it should only be there for the initial startup (i.e., it probably should be happening but probably should be initiated from a slightly different location in the code).

@warner
Copy link
Member

warner commented Aug 18, 2021

Oh, like do it in initializeKernel rather than start?

One quick solution would be to add crankNumber to the set of DB keys that are omitted from the crankhash.

@warner
Copy link
Member

warner commented Aug 18, 2021

Currently, we call kk.incrementCrankNumber() at the end of start(). Then, at the end of a delivery, we do:

  • kk.commitCrank()
  • crankNum = kk.getCrankNumber() (for the slog)
  • kk.incrementCrankNumber()

which means there will be a pending (not-yet-committed) DB write of crankNumber hanging out at all times.

I think the right solution is to call kernelKeeper.incrementCrankNumber() just before commitCrank, and not call it during start. The end-of-delivery cycle will look like:

  • crankNum = kk.getCrankNumber() (for the slog)
  • kk.incrementCrankNumber()
  • kk.commitCrank()

That should remove the extra DB write during start.

@FUDCo
Copy link
Contributor

FUDCo commented Aug 18, 2021

I don't remember for sure but I think there may be one or two other things that read the crank number during the crank, and these need a correct number to read. Incrementing the crank number at the start of the crank is the right thing to do, but we do need to take care that it can't happen twice.

@warner
Copy link
Member

warner commented Aug 18, 2021

Oh, it's deeper than that. The crankbuffer includes the crankhash, which is accumulated as DB writes and deletes happen (we keep a crankhasher object around, built out of Node.js crypto.createHash). But that state is ephemeral: it only lives in RAM, and only gets finalized during commitCrank(). On the chain, we get plenty of timer-device poll() calls (which change state) that don't result in any deliveries (because the timer device knows that nothing needs to be woken up), so they never call commitCrank(), leaving those hashed state changes lying around in RAM. The chain commits the (unchanged) activityhash, but that doesn't include the intermediate state of the crankhasher. If you bounce the node, the new process will have less stuff in its crankhasher than the predecessor did, but nobody will notice until an actual delivery happens and the crankhasher is consulted.

The fix is going to be to commit more often. In addition to the incrementCrankNumber reordering described above, we need to make sure that we do a commitCrank() as we exit c.run(). Ideally we'd do a commit after allowing the device input function to run (this would be a fine addition to the #720 input queue). The invariant to maintain is that when swingset returns control to the application, its state is internally consistent and committed (up to the block buffer).

warner added a commit that referenced this issue Aug 18, 2021
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until a
delivery is successfully made to some vat, which occurs in a subset of the
times that `kernel.run()` is invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
@dckc dckc changed the title CONSENSUS FAILURE at block=1294 consensus failure due to incorrect crankhash synchronization Aug 18, 2021
@dckc
Copy link
Member Author

dckc commented Aug 18, 2021

I promoted the diagnosis to the title.

This might be a bit premature, but it seems pretty likely.

warner added a commit that referenced this issue Aug 18, 2021
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until
processQueueMessage() finishes, which only happens if the run-queue had some
work to do. This occurs in a subset of the times that `kernel.run()` is
invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
@warner
Copy link
Member

warner commented Aug 18, 2021

Ok, here's an attempt to explain the issue

The current/old code looks like this:
3720-old-works

The empty/white boxes are actions which the host application takes: calling start(), adding something to the mailbox, calling the timer-device poll() function, etc. The green boxes are state changes that get added to the crank buffer. The red box is the crank buffer, which gets committed (and hashed, and emptied) when commitCrank() is called at the end of a delivery. The blue box is when the host calls controller.run() (e.g. kernel.run()). After the host calls run(), the host commits their own state, because it believes swingset has committed it's own state at that point.

In this first diagram, we can survive the save/reload cycle because the mailbox deliveries both cause a push to the run-queue, which means the subsequent run() will do a commit. The crank number is incremented during start() and after the commitCrank(). The contents of the red boxes are the same in both the original and the reloaded process, so their crankhashes (and therefore activityhashes) will be the same.

But here's what happens when one of those mailbox deliveries is replaced with a timer-device poll that doesn't cause a run-queue event:

3720-old-fails

Here, "timer poll 1" causes timer-device state changes but not a run-queue event. The subsequent k.run() does not call commitCrank(), so the changes remain on the crankbuffer, and the activityhash is not updated. The host commits, but there's still ephemeral data in swingset. Later, "timer poll 2" happens, which does cause run-queue activity, so the k.run() does commit, and finalizes the crankhash. But in the reloaded process, the timer state changes from "timer poll 1" are lost, and the contents of the last red crank-buffer box are different.

@warner
Copy link
Member

warner commented Aug 18, 2021

The fixed version looks like this:

3720-new-1

Now we always commit() in k.run(), even if there was nothing on the run-queue. And we move the incrementCrankNumber to be before the commit, and remove it from start(). The fencepost is moved, and the crank buffer is strictly empty by the time the host does a commit.

When this approach is presented with the timer-device events:

3720-new-2

we commit the "timer poll 1" device state changes during the subsequent k.run(), even though it doesn't cause any deliveries to happen. The reloaded process sees the same crankbuffer inputs to "timer poll 2" as the original process, so the crankhashes (and activityhashes) should be the same.

@warner
Copy link
Member

warner commented Aug 18, 2021

#3723 implements this fixed ordering

warner added a commit that referenced this issue Aug 18, 2021
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until
processQueueMessage() finishes, which only happens if the run-queue had some
work to do. This occurs in a subset of the times that `kernel.run()` is
invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
@warner
Copy link
Member

warner commented Aug 18, 2021

fixed by #3723 , although #3726 is a second failure with a separate solution

@warner warner closed this as completed Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants