Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue kernel input events #720

Closed
michaelfig opened this issue Mar 18, 2020 · 1 comment
Closed

Queue kernel input events #720

michaelfig opened this issue Mar 18, 2020 · 1 comment
Labels
SwingSet package: SwingSet

Comments

@michaelfig
Copy link
Member

@warner and I discovered that there were reentrancy problems with the ag-solo inbound command processing. The general solution to reentrancy is to create a SwingSet input queue, as is hinted at in #710

We may want to adopt a queuing mechanism as part of SwingSet so that other hosts don't have the same problems by construction.

@warner
Copy link
Member

warner commented Feb 4, 2021

@dckc and I found another need for this today: the xsnap vat manager spawns a child process and controls it with netstrings over a pipe. The initial message ("evaluate SES, load this bundle, tell me when you're ready") is kicked off by the creation of a dynamic vat. The kernel does not wait for this to complete: it assumes it will take a while and continues on to the next crank. At some point in the future, the child finishes its preparations and sends back the "I'm ready" message, causing the VatManager to resolve a Promise which results in the run-queue getting a new message (aimed at the vatAdmin vat).

If this "I'm ready" input occurs in the middle of some existing crank, things could get confused: the run-queue message will be part of the same state delta as the existing crank. If that crank were rolled back for whatever reason, the "dynamic vat is ready" message would be discarded as well. This becomes more likely to happen when all vats are running on xsnap, because they'll all be waiting for IO and thus receptive to IO for unrelated vats/workers. The consequences are not dire so far, because cranks aren't usually abandoned, but we certainly need to fix it.

The solution will be to route this "worker process is ready" event through the same input queue that we want for ag-solo inbound command processing. We don't need to do this for the other IO we do to a vat worker (the ones that start and end a crank, and the ones which process syscalls during a crank) because the kernel knows it is in the middle of a crank when those happen, and the unrelated workers should not be writing anything to the kernel at that time. (Although.. now that I think about it, if a worker process spontaneously dies, outside the context of a crank, we need to decide how and when to handle that: do we hide the fact until someone tries to talk to the vat again? do we pretend that it was terminated intentionally? do we try to re-load the vat from the previous state and mask the surprising exit?).

The other reason we want the kernel to have an input queue is for device inputs. One example is the "mailbox device", whose external API surface includes a function which the host can call when data arrives from a remote machine (perhaps in the POST handler of an HTTP server). The device reacts to this arrival by adding a message to the run-queue, which means changing kernel state, which runs all the risks described above for vat worker startup. Currently, cosmic-swingset has an external queue for this purpose, but we should move it into the kernel proper.

This overlaps with device setup (#1351) because we want the external API surface to be routed through the kernel's input queue. That would be easier if we knew what that surface was. I'm thinking the device construction function should return an object which properties are API functions, but the controller wraps each function in the input queue interlock before revealing the wrappers to the caller. Then e.g. the webserver could call mailboxAPI.addMessage(data) at any moment, without regard to the state the kernel is in, and the kernel input queue would take responsibility for delaying its processing until the kernel is between cranks.

That "input handling window" should probably be immediately after each crank is complete, after the state is committed. The window should have its own DB transaction (flush the crankBuffer, but not the blockBuffer, since the IO call still happens within the context of a block, if the host application uses blocks), and should be flushed afterwards to make sure anything added to the run-queue by the input function actually sticks. I think it's safe to drain the entire input queue after each crank, and not worry about scheduling them any slower.

warner added a commit that referenced this issue Aug 18, 2021
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until a
delivery is successfully made to some vat, which occurs in a subset of the
times that `kernel.run()` is invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
warner added a commit that referenced this issue Aug 18, 2021
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until
processQueueMessage() finishes, which only happens if the run-queue had some
work to do. This occurs in a subset of the times that `kernel.run()` is
invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
warner added a commit that referenced this issue Aug 18, 2021
We observed a consensus failure on the chain that was triggered by a
validator getting restarted. The root cause seems to be that device
calls (specifically `poll()` on the timer device) cause state changes, which
are accumulated in the crankhasher and linger in the crank buffer until
something causes them to be committed. This commit doesn't happen until
processQueueMessage() finishes, which only happens if the run-queue had some
work to do. This occurs in a subset of the times that `kernel.run()` is
invoked.

On the chain, each block causes a `poll()` followed by a `kernel.run()`. If
the `poll()` did not cause any timers to fire, the timer device will not
enqueue any messages to the timer vat, the run-queue remains empty, and
`run()` has no work to do. Therefore the crank buffer is not committed.

Imagine a chain in which some delivery causes everything to be committed in
block 1, then blocks 2-9 have a `poll()` but no actual deliveries, then
something happens to make block 10 have a delivery (causing a commit).

Now follow validator A (which is not restarted): when the block 10 delivery
happens, the crank buffer (and crankhasher) has accumulated all the state
changes from blocks 2-10, including all the timer device state udpates from
the `poll()` calls.

Suppose validator B is restarted after block 5. When it resumes, the crank
buffer is empty, and the crankhasher is at the starting state. Blocks 6-10
add changes to both, and at block 10 the `commit()` causes the crankhasher to
be finalized. The total data in the crankhasher is different (the input is
smaller) on B than on A, and the crankhash thus diverges. This causes the
activityhash to diverge, which causes a consensus failure.

The invariant we must maintain is that the crank buffer and crankhasher must
be empty at any time the host application might commit the block buffer.
There must be no lingering changes in the crank buffer, because that buffer
is ephemeral (it lives only in RAM).

The simplest way to achieve this is to add a `commitCrank()` to the exit
paths of `kernel.step()` and `kernel.run()`, so that the invariant is
established before the host application regains control.

We must also rearrange the way that `incrementCrankNumber` is called.
Previously, it was called at the end of `start()`, leaving the "current"
crank number in the crank buffer (but not committed to the block buffer yet).
Then, at end-of-crank, the sequence was `commitCrank()` followed by
`incrementCrankNumber()`. In this approach, the crank buffer was never empty:
except for the brief moment between these two calls, it always had a pending
update.

In the new approach, we remove the call from `start()`, and we increment the
crank number just *before* the end-of-crank commit. This way, the crank
buffer remains empty between cranks.

One additional piece remains: we should really commit the crank buffer after
every device call. The concern is that an e.g. timer device `poll()` causes
some state changes but no delivery, and then the next delivery fails (i.e. it
causes the target vat to terminate). We don't want the unrelated device
changes to be discarded along with the vat's half-completed activity. This
would be a good job for the #720 kernel input queue.

refs #3720
@Agoric Agoric locked and limited conversation to collaborators May 31, 2022
@michaelfig michaelfig converted this issue into discussion #5474 May 31, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

2 participants