Skip to content

Commit

Permalink
fix(swingset): commit crank-buffer after every c.step/c.run
Browse files Browse the repository at this point in the history
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until
processQueueMessage() finishes, which only happens if the run-queue had some
work to do. This occurs in a subset of the times that `kernel.run()` is
invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
  • Loading branch information
warner committed Aug 18, 2021
1 parent d395e3f commit 90c2b85
Show file tree
Hide file tree
Showing 3 changed files with 99 additions and 3 deletions.
8 changes: 5 additions & 3 deletions packages/SwingSet/src/kernel/kernel.js
Original file line number Diff line number Diff line change
Expand Up @@ -801,15 +801,15 @@ export default function buildKernel(
}
kernelKeeper.processRefcounts();
kernelKeeper.saveStats();
const { crankhash, activityhash } = kernelKeeper.commitCrank();
const crankNum = kernelKeeper.getCrankNumber();
kernelKeeper.incrementCrankNumber();
const { crankhash, activityhash } = kernelKeeper.commitCrank();
kernelSlog.write({
type: 'crank-finish',
crankNum,
crankhash,
activityhash,
});
kernelKeeper.incrementCrankNumber();
} finally {
processQueueRunning = undefined;
}
Expand Down Expand Up @@ -1066,7 +1066,6 @@ export default function buildKernel(
}

kernelKeeper.loadStats();
kernelKeeper.incrementCrankNumber();
}

function getNextMessage() {
Expand Down Expand Up @@ -1094,8 +1093,10 @@ export default function buildKernel(
if (kernelPanic) {
throw kernelPanic;
}
kernelKeeper.commitCrank();
return 1;
} else {
kernelKeeper.commitCrank();
return 0;
}
}
Expand Down Expand Up @@ -1150,6 +1151,7 @@ export default function buildKernel(
return count;
}
}
kernelKeeper.commitCrank();
return count;
}

Expand Down
88 changes: 88 additions & 0 deletions packages/SwingSet/test/test-activityhash-vs-start.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
// eslint-disable-next-line import/order
import { test } from '../tools/prepare-test-env-ava.js';

// eslint-disable-next-line import/order
import { getAllState, setAllState } from '@agoric/swing-store-simple';
import { provideHostStorage } from '../src/hostStorage.js';
import { initializeSwingset, makeSwingsetController } from '../src/index.js';
import { capargs } from './util.js';
import { buildTimer } from '../src/devices/timer.js';

const TimerSrc = new URL('../src/devices/timer-src.js', import.meta.url)
.pathname;

test('restarting kernel does not change activityhash', async t => {
const sourceSpec = new URL('vat-empty-setup.js', import.meta.url).pathname;
const config = {
bootstrap: 'bootstrap',
vats: {
bootstrap: {
sourceSpec,
creationOptions: {
enableSetup: true,
},
},
},
devices: {
timer: {
sourceSpec: TimerSrc,
},
},
};
const timer1 = buildTimer();
const deviceEndowments1 = {
timer: { ...timer1.endowments },
};
const hs1 = provideHostStorage();
// console.log(`--c1 build`);
await initializeSwingset(config, [], hs1);
const c1 = await makeSwingsetController(hs1, deviceEndowments1);
c1.pinVatRoot('bootstrap');
// console.log(`--c1 poll1`);
timer1.poll(1);
// console.log(`--c1 run1`);
await c1.run();

// console.log(`--c1 getAllState`);
const state = getAllState(hs1);
// console.log(`ah: ${c1.getActivityhash()}`);

// console.log(`--c1 poll1`);
timer1.poll(2);
// console.log(`--c1 run2`);
await c1.run();

// console.log(`--c1 dummy()`);
c1.queueToVatRoot('bootstrap', 'dummy', capargs([]));
// console.log(`--c1 run3`);
await c1.run();
const c1ah = c1.getActivityhash();
await c1.shutdown();
// console.log(`--c1 shutdown`);

// a kernel restart is loading a new kernel from the same state
const timer2 = buildTimer();
const deviceEndowments2 = {
timer: { ...timer2.endowments },
};
const hs2 = provideHostStorage();
setAllState(hs2, state);
// console.log(`--c2 build`);
const c2 = await makeSwingsetController(hs2, deviceEndowments2);
// console.log(`ah: ${c2.getActivityhash()}`);

// console.log(`--c2 poll1`);
timer2.poll(2);
// console.log(`--c2 run2`);
await c2.run();

// console.log(`--c2 dummy()`);
c2.queueToVatRoot('bootstrap', 'dummy', capargs([]));
// console.log(`--c2 run3`);
await c2.run();

const c2ah = c2.getActivityhash();
await c2.shutdown();

t.is(c1ah, c2ah);
});
6 changes: 6 additions & 0 deletions packages/SwingSet/test/vat-empty-setup.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
export default function setup() {
function dispatch(_vatDeliverObject) {
// ignore everything
}
return harden(dispatch);
}

0 comments on commit 90c2b85

Please sign in to comment.