consensus failure due to incorrect crankhash synchronization #3720

dckc · 2021-08-18T04:17:37Z

Describe the bug

a bit vague, for now

@michaelfig restarted one validator at block=1265, and then got CONSENSUS FAILURE on it at block=1294.

To Reproduce

Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

A clear and concise description of what you expected to happen.

Platform Environment

what OS are you using? what version of Node.js?
is there anything special/unusual about your platform?
what version of the Agoric-SDK are you using? (run git describe --tags --always)
35cb64e I think. i.e. @agoric/[email protected]

Additional context

Add any other context about the problem here.

Screenshots

If applicable, add screenshots to help explain your problem, especially for UI interactions.

The text was updated successfully, but these errors were encountered:

refs #3720

warner · 2021-08-18T06:12:05Z

The latest experiment, with two validators ("good" and "bad"). We bounced "bad" between block 92 and 93. Nothing happens until the next timer wakeup, which must do a timerDevice.poll(blockTime) during the BEGIN_BLOCK phase for block 114, which must 1: do four SO(handler).wake(time) calls (pushed onto the run-queue), and 2: do a setDeviceState() (updating the current time). The subsequent controller.run() performs four wake deliveries to the timer vat, none of which cause any syscalls (because.. nobody is waiting on them? then why did the timer device choose to tell the timer vat??).

The first such delivery reports a different crankhash in the restarted ("bad") validator than the other. This suggests that one of the state writes was different in that validator. The run-queue writes are probably the same (as evidenced by the same deliveries being made), but the setDeviceState could have been different, or it could have happened a different number of times. Either would cause the crankhash to diverge.

warner · 2021-08-18T06:35:00Z

I instrumented storageWrapper.js to log each time a DB key is changed, and it looks like there's an extra call to kernelKeeper.incrementCrankNumber() at the end of kernel.js start() which causes a DB write during restart that doesn't happen otherwise. That would be enough to throw off the crankhash.

Simply removing the increment call causes tests to fail: some are trivial, others are legitimate. I think there's a subtle fencepost problem happening that necessitated the increment, however the extra DB write is a problem. @FUDCo I'll be hitting you up for whatever you can remember about that increment call.

FUDCo · 2021-08-18T06:38:52Z

I'm fairly sure that increment is supposed to be there, but I think it should only be there for the initial startup (i.e., it probably should be happening but probably should be initiated from a slightly different location in the code).

warner · 2021-08-18T06:39:21Z

Oh, like do it in initializeKernel rather than start?

One quick solution would be to add crankNumber to the set of DB keys that are omitted from the crankhash.

warner · 2021-08-18T07:10:54Z

Currently, we call kk.incrementCrankNumber() at the end of start(). Then, at the end of a delivery, we do:

kk.commitCrank()
crankNum = kk.getCrankNumber() (for the slog)
kk.incrementCrankNumber()

which means there will be a pending (not-yet-committed) DB write of crankNumber hanging out at all times.

I think the right solution is to call kernelKeeper.incrementCrankNumber() just before commitCrank, and not call it during start. The end-of-delivery cycle will look like:

crankNum = kk.getCrankNumber() (for the slog)
kk.incrementCrankNumber()
kk.commitCrank()

That should remove the extra DB write during start.

FUDCo · 2021-08-18T07:20:40Z

I don't remember for sure but I think there may be one or two other things that read the crank number during the crank, and these need a correct number to read. Incrementing the crank number at the start of the crank is the right thing to do, but we do need to take care that it can't happen twice.

warner · 2021-08-18T15:34:35Z

Oh, it's deeper than that. The crankbuffer includes the crankhash, which is accumulated as DB writes and deletes happen (we keep a crankhasher object around, built out of Node.js crypto.createHash). But that state is ephemeral: it only lives in RAM, and only gets finalized during commitCrank(). On the chain, we get plenty of timer-device poll() calls (which change state) that don't result in any deliveries (because the timer device knows that nothing needs to be woken up), so they never call commitCrank(), leaving those hashed state changes lying around in RAM. The chain commits the (unchanged) activityhash, but that doesn't include the intermediate state of the crankhasher. If you bounce the node, the new process will have less stuff in its crankhasher than the predecessor did, but nobody will notice until an actual delivery happens and the crankhasher is consulted.

The fix is going to be to commit more often. In addition to the incrementCrankNumber reordering described above, we need to make sure that we do a commitCrank() as we exit c.run(). Ideally we'd do a commit after allowing the device input function to run (this would be a fine addition to the #720 input queue). The invariant to maintain is that when swingset returns control to the application, its state is internally consistent and committed (up to the block buffer).

We observed a consensus failure on the chain that was triggered by a validator getting restarted. The root cause seems to be that device calls (specifically `poll()` on the timer device) cause state changes, which are accumulated in the crankhasher and linger in the crank buffer until something causes them to be committed. This commit doesn't happen until a delivery is successfully made to some vat, which occurs in a subset of the times that `kernel.run()` is invoked. On the chain, each block causes a `poll()` followed by a `kernel.run()`. If the `poll()` did not cause any timers to fire, the timer device will not enqueue any messages to the timer vat, the run-queue remains empty, and `run()` has no work to do. Therefore the crank buffer is not committed. Imagine a chain in which some delivery causes everything to be committed in block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then something happens to make block 10 have a delivery (causing a commit). Now follow validator A (which is not restarted): when the block 10 delivery happens, the crank buffer (and crankhasher) has accumulated all the state changes from blocks 2-10, including all the timer device state udpates from the `poll()` calls. Suppose validator B is restarted after block 5. When it resumes, the crank buffer is empty, and the crankhasher is at the starting state. Blocks 6-10 add changes to both, and at block 10 the `commit()` causes the crankhasher to be finalized. The total data in the crankhasher is different (the input is smaller) on B than on A, and the crankhash thus diverges. This causes the activityhash to diverge, which causes a consensus failure. The invariant we must maintain is that the crank buffer and crankhasher must be empty at any time the host application might commit the block buffer. There must be no lingering changes in the crank buffer, because that buffer is ephemeral (it lives only in RAM). The simplest way to achieve this is to add a `commitCrank()` to the exit paths of `kernel.step()` and `kernel.run()`, so that the invariant is established before the host application regains control. We must also rearrange the way that `incrementCrankNumber` is called. Previously, it was called at the end of `start()`, leaving the "current" crank number in the crank buffer (but not committed to the block buffer yet). Then, at end-of-crank, the sequence was `commitCrank()` followed by `incrementCrankNumber()`. In this approach, the crank buffer was never empty: except for the brief moment between these two calls, it always had a pending update. In the new approach, we remove the call from `start()`, and we increment the crank number just *before* the end-of-crank commit. This way, the crank buffer remains empty between cranks. One additional piece remains: we should really commit the crank buffer after every device call. The concern is that an e.g. timer device `poll()` causes some state changes but no delivery, and then the next delivery fails (i.e. it causes the target vat to terminate). We don't want the unrelated device changes to be discarded along with the vat's half-completed activity. This would be a good job for the #720 kernel input queue. refs #3720

dckc · 2021-08-18T17:21:08Z

I promoted the diagnosis to the title.

This might be a bit premature, but it seems pretty likely.

We observed a consensus failure on the chain that was triggered by a validator getting restarted. The root cause seems to be that device calls (specifically `poll()` on the timer device) cause state changes, which are accumulated in the crankhasher and linger in the crank buffer until something causes them to be committed. This commit doesn't happen until processQueueMessage() finishes, which only happens if the run-queue had some work to do. This occurs in a subset of the times that `kernel.run()` is invoked. On the chain, each block causes a `poll()` followed by a `kernel.run()`. If the `poll()` did not cause any timers to fire, the timer device will not enqueue any messages to the timer vat, the run-queue remains empty, and `run()` has no work to do. Therefore the crank buffer is not committed. Imagine a chain in which some delivery causes everything to be committed in block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then something happens to make block 10 have a delivery (causing a commit). Now follow validator A (which is not restarted): when the block 10 delivery happens, the crank buffer (and crankhasher) has accumulated all the state changes from blocks 2-10, including all the timer device state udpates from the `poll()` calls. Suppose validator B is restarted after block 5. When it resumes, the crank buffer is empty, and the crankhasher is at the starting state. Blocks 6-10 add changes to both, and at block 10 the `commit()` causes the crankhasher to be finalized. The total data in the crankhasher is different (the input is smaller) on B than on A, and the crankhash thus diverges. This causes the activityhash to diverge, which causes a consensus failure. The invariant we must maintain is that the crank buffer and crankhasher must be empty at any time the host application might commit the block buffer. There must be no lingering changes in the crank buffer, because that buffer is ephemeral (it lives only in RAM). The simplest way to achieve this is to add a `commitCrank()` to the exit paths of `kernel.step()` and `kernel.run()`, so that the invariant is established before the host application regains control. We must also rearrange the way that `incrementCrankNumber` is called. Previously, it was called at the end of `start()`, leaving the "current" crank number in the crank buffer (but not committed to the block buffer yet). Then, at end-of-crank, the sequence was `commitCrank()` followed by `incrementCrankNumber()`. In this approach, the crank buffer was never empty: except for the brief moment between these two calls, it always had a pending update. In the new approach, we remove the call from `start()`, and we increment the crank number just *before* the end-of-crank commit. This way, the crank buffer remains empty between cranks. One additional piece remains: we should really commit the crank buffer after every device call. The concern is that an e.g. timer device `poll()` causes some state changes but no delivery, and then the next delivery fails (i.e. it causes the target vat to terminate). We don't want the unrelated device changes to be discarded along with the vat's half-completed activity. This would be a good job for the #720 kernel input queue. refs #3720

warner · 2021-08-18T18:08:45Z

Ok, here's an attempt to explain the issue

The current/old code looks like this:

The empty/white boxes are actions which the host application takes: calling start(), adding something to the mailbox, calling the timer-device poll() function, etc. The green boxes are state changes that get added to the crank buffer. The red box is the crank buffer, which gets committed (and hashed, and emptied) when commitCrank() is called at the end of a delivery. The blue box is when the host calls controller.run() (e.g. kernel.run()). After the host calls run(), the host commits their own state, because it believes swingset has committed it's own state at that point.

In this first diagram, we can survive the save/reload cycle because the mailbox deliveries both cause a push to the run-queue, which means the subsequent run() will do a commit. The crank number is incremented during start() and after the commitCrank(). The contents of the red boxes are the same in both the original and the reloaded process, so their crankhashes (and therefore activityhashes) will be the same.

But here's what happens when one of those mailbox deliveries is replaced with a timer-device poll that doesn't cause a run-queue event:

Here, "timer poll 1" causes timer-device state changes but not a run-queue event. The subsequent k.run() does not call commitCrank(), so the changes remain on the crankbuffer, and the activityhash is not updated. The host commits, but there's still ephemeral data in swingset. Later, "timer poll 2" happens, which does cause run-queue activity, so the k.run() does commit, and finalizes the crankhash. But in the reloaded process, the timer state changes from "timer poll 1" are lost, and the contents of the last red crank-buffer box are different.

warner · 2021-08-18T18:12:03Z

The fixed version looks like this:

Now we always commit() in k.run(), even if there was nothing on the run-queue. And we move the incrementCrankNumber to be before the commit, and remove it from start(). The fencepost is moved, and the crank buffer is strictly empty by the time the host does a commit.

When this approach is presented with the timer-device events:

we commit the "timer poll 1" device state changes during the subsequent k.run(), even though it doesn't cause any deliveries to happen. The reloaded process sees the same crankbuffer inputs to "timer poll 2" as the original process, so the crankhashes (and activityhashes) should be the same.

warner · 2021-08-18T18:12:45Z

#3723 implements this fixed ordering

We observed a consensus failure on the chain that was triggered by a validator getting restarted. The root cause seems to be that device calls (specifically `poll()` on the timer device) cause state changes, which are accumulated in the crankhasher and linger in the crank buffer until something causes them to be committed. This commit doesn't happen until processQueueMessage() finishes, which only happens if the run-queue had some work to do. This occurs in a subset of the times that `kernel.run()` is invoked. On the chain, each block causes a `poll()` followed by a `kernel.run()`. If the `poll()` did not cause any timers to fire, the timer device will not enqueue any messages to the timer vat, the run-queue remains empty, and `run()` has no work to do. Therefore the crank buffer is not committed. Imagine a chain in which some delivery causes everything to be committed in block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then something happens to make block 10 have a delivery (causing a commit). Now follow validator A (which is not restarted): when the block 10 delivery happens, the crank buffer (and crankhasher) has accumulated all the state changes from blocks 2-10, including all the timer device state udpates from the `poll()` calls. Suppose validator B is restarted after block 5. When it resumes, the crank buffer is empty, and the crankhasher is at the starting state. Blocks 6-10 add changes to both, and at block 10 the `commit()` causes the crankhasher to be finalized. The total data in the crankhasher is different (the input is smaller) on B than on A, and the crankhash thus diverges. This causes the activityhash to diverge, which causes a consensus failure. The invariant we must maintain is that the crank buffer and crankhasher must be empty at any time the host application might commit the block buffer. There must be no lingering changes in the crank buffer, because that buffer is ephemeral (it lives only in RAM). The simplest way to achieve this is to add a `commitCrank()` to the exit paths of `kernel.step()` and `kernel.run()`, so that the invariant is established before the host application regains control. We must also rearrange the way that `incrementCrankNumber` is called. Previously, it was called at the end of `start()`, leaving the "current" crank number in the crank buffer (but not committed to the block buffer yet). Then, at end-of-crank, the sequence was `commitCrank()` followed by `incrementCrankNumber()`. In this approach, the crank buffer was never empty: except for the brief moment between these two calls, it always had a pending update. In the new approach, we remove the call from `start()`, and we increment the crank number just *before* the end-of-crank commit. This way, the crank buffer remains empty between cranks. One additional piece remains: we should really commit the crank buffer after every device call. The concern is that an e.g. timer device `poll()` causes some state changes but no delivery, and then the next delivery fails (i.e. it causes the target vat to terminate). We don't want the unrelated device changes to be discarded along with the vat's half-completed activity. This would be a good job for the #720 kernel input queue. refs #3720

warner · 2021-08-18T20:43:05Z

fixed by #3723 , although #3726 is a second failure with a separate solution

dckc added the bug Something isn't working label Aug 18, 2021

warner added a commit that referenced this issue Aug 18, 2021

fix(swingset): record cranks and crankhash/activityhash in slog

ec1c7e1

refs #3720

warner mentioned this issue Aug 18, 2021

fix(swingset): record cranks and crankhash/activityhash in slog #3721

Merged

warner added a commit that referenced this issue Aug 18, 2021

fix(swingset): record cranks and crankhash/activityhash in slog

8bc08de

refs #3720

warner mentioned this issue Aug 18, 2021

fix(swingset): commit crank-buffer after every c.step/c.run #3723

Merged

dckc changed the title ~~CONSENSUS FAILURE at block=1294~~ consensus failure due to incorrect crankhash synchronization Aug 18, 2021

warner mentioned this issue Aug 18, 2021

consensus failure due to comms get(initialized) check #3726

Closed

warner closed this as completed Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus failure due to incorrect crankhash synchronization #3720

consensus failure due to incorrect crankhash synchronization #3720

dckc commented Aug 18, 2021 •

edited

Loading

warner commented Aug 18, 2021

warner commented Aug 18, 2021

FUDCo commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

FUDCo commented Aug 18, 2021 •

edited

Loading

warner commented Aug 18, 2021

dckc commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

consensus failure due to incorrect crankhash synchronization #3720

consensus failure due to incorrect crankhash synchronization #3720

Comments

dckc commented Aug 18, 2021 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Platform Environment

Additional context

Screenshots

warner commented Aug 18, 2021

warner commented Aug 18, 2021

FUDCo commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

FUDCo commented Aug 18, 2021 • edited Loading

warner commented Aug 18, 2021

dckc commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

warner commented Aug 18, 2021

dckc commented Aug 18, 2021 •

edited

Loading

FUDCo commented Aug 18, 2021 •

edited

Loading