Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exclude liveslots deserialization code from metering #3667

Merged
merged 4 commits into from
Aug 13, 2021

Conversation

warner
Copy link
Member

@warner warner commented Aug 12, 2021

We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (wr.deref() === undefined) at a random
point in the middle of a turn. Any code which calls wr.deref, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes convertSlotToVal, and therefore m.unserialize.

We cannot allow metering results to diverge between validators, because:

  • 1: it might make the difference between the crank completing successfully,
    and the vat being terminated for a per-crank metering fault
  • 2: it will change the large-scale Meter value, which is reported to
    userspace
  • 3: it might cause the runPolicy to finish the block earlier on one
    validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all dispatch.* methods, which
runs entirely within liveslots:

  • dispatch.deliver performs argument deserialization in the first turn,
    then executes user code in the second and subsequent turns
  • dispatch.notify does the same
  • the GC deliveries (dispatch.dropExport, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
slotToVal).

Finally, we disable metering for all turns of the post-crank GC finish()
call. This excludes all invocations of the finalizer callbacks, as well as
all the processDeadSet code which is highly sensitive to the results.

closes #3458

@warner warner added the SwingSet package: SwingSet label Aug 12, 2021
@warner warner added this to the Testnet: Metering Phase milestone Aug 12, 2021
@warner warner requested a review from FUDCo August 12, 2021 19:40
@warner warner self-assigned this Aug 12, 2021
@warner warner force-pushed the 3458-unmetered-liveslots branch 3 times, most recently from c621af7 to dfeff9e Compare August 13, 2021 01:34
Comment on lines 60 to 68
const limit = globalThis.currentMeterLimit();
const before = globalThis.resetMeter(0, 0);
meteringDisabled += 1;
try {
return await thunk();
} finally {
globalThis.resetMeter(limit, before);
meteringDisabled -= 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nervous whether this pattern actually works -- can you get away with scoping async operations like this, even if you use await?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked with @erights and @michaelfig , and the answer is "yes, but don't do it that way". The finally is guaranteed to run, and in our case this pattern would be fine, but their recommendation was to use this instead:

  async function runWithoutMeteringAsync(thunk) {
    const limit = globalThis.currentMeterLimit();
    const before = globalThis.resetMeter(0, 0);
    meteringDisabled += 1;
    return Promise.resolve()
      .then(() => thunk())
      .finally(() => {
        globalThis.resetMeter(limit, before);
        meteringDisabled -= 1;
      });
  }

The difference is that if thunk() throws (synchronously), in my version (return await thunk()) the finally block will get executed in the same turn, but in theirs (using Promise.prototype.finally) the finalizer always runs in its own turn.

I'll update it to use their recommendation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked with @erights and @michaelfig , and the answer is "yes, but don't do it that way". The finally is guaranteed to run, and in our case this pattern would be fine, but their recommendation was to use this instead:

Just for the record, what did I say "yes" to? The fact that the finally may run in the same or another turn, as you correctly explain, for me means that code does not work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Yes" meant that finally { } always runs. even if there's a return or await in the try { }, which I think was the "does JS work the way you seem to think it works" question that @FUDCo was asking. I think we were both uncertain about whether the JS try/finally pattern continues to work after the introduction of async/await, but it does: the general semantics are the same, modulo the timing subtlety you pointed out.

In this case, my code was not depending upon having a turn boundary between the thunk running and the counter being decremented. runWithoutMeteringAsync is not being granted to adversarial code (this is entirely within liveslots), so there's not an adversary to perform reentrancy. The thunk is also friendly code (if it were adversarial, a trivial Promise.then would let it regain agency after the thunk's Promise had returned). I'm all in favor of using a pattern that's easier to review, and I've updated this PR to do so, but in this case it wouldn't have caused a bug to do it the return await thunk() way.

Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well OK then

Comment on lines 466 to 468
// Set.delete() metering seems unaffected by presence/absence, but it
// doesn't matter anyway because deadSet.add only happens when
// finializers run, which happens deterministically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that this comment is not quite right. Finalizers run non-deterministically -- that's why we use deadSet to protect ourselves. It's the processing of deadSet that's run deterministically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "finalizers run deterministically" I mean that, given the way xsnap.c uses XS, finalizer callbacks only run during the gcAndFinalize() window. XS only polls the registries during fxEndJob(), and we only call that after the promise queue is empty (

):

void fxRunLoop(txMachine* the)
{
	c_timeval tv;
	txNumber when;
	txJob* job;
	txJob** address;
	for (;;) {
		while (the->promiseJobs) {
			while (the->promiseJobs) {
				the->promiseJobs = 0;
				fxRunPromiseJobs(the);
			}
			// give finalizers a chance to run after the promise queue is empty
			fxEndJob(the);
			// if that added to the promise queue, start again
		}
		// at this point the promise queue is empty
 // .. the rest of the code checks for timer jobs

That's not mandatory: I think the JS spec allows engines to run the finalizer callbacks any time they want (as long as they happen in their own turn). The tricky part, for us, is metering. If the finalizer could happen at any arbitrary time, then it would get metered. The only thing a finalizer does is to examine the weakref and add non-revived vrefs to the deadset, which doesn't cause e.g. immediate syscalls, but even that would cause metering sensitivity to the state of the weakref:

  function finalizeDroppedImport(vref) {
    const wr = slotToVal.get(vref);
    if (wr && !wr.deref()) {
      // we're in the COLLECTED state, or FINALIZED after a re-introduction
      deadSet.add(vref);
      slotToVal.delete(vref);
      // console.log(`-- adding ${vref} to deadSet`);
    }
  }

So our "metering is insensitive to GC timing" story already depends upon finalizers not running during the userspace phase.

If we were somehow not depending upon that, and finalizers could run at any old time, then a vref might or might not be in the deadSet when this convertValToSlot needed to delete it. We delete it because this might be a re-introduction of the vref: imagine a dispatch.deliver() crank which introduces o-4, then drops it, then performs a device invocation whose return value re-introduces o-4. If the finalizer could run in the middle of the crank, we'd have o-4 in our deadSet when the D() return value was unserialized and convertValToSlot('o-4') creates a new Presence, but o-4 is still in the deadSet. (Now that I think about it, we don't strictly need to delete it from deadSet.. processDeadSet will check the weakref anyways. But I think you and I decided that it would be tidier to delete it as soon as we know it's been re-introduced).

So in that case, deadSet.delete('o-4') would sometimes be deleting something, and sometimes not, depending upon this hypothetical non-deterministic finalizer callback having already run or not. If .delete's metering usage were different in these two situations, then now the metering of convertValToSlot (i.e. m.serialize()) would become sensitive to GC state, which we don't want, because we want serialization to be metered, because (at least) it could invoke user-provided getters, which would allow user code to escape metering.

If that deadSet.delete(slot) were replaced with if (deadSet.has(slot)) { deadSet.delete(slot); }, that'd be obviously metering-affecting. But I wasn't sure about the built-in deadSet.delete().. does it follow differently-metered code paths internally? I did an experiment and convinced myself that, at least right now, it does not: deleting an existing key and deleting a non-existing key both use the same metering.

So we're protected in two ways. 1: deadSet.delete()'s metering doesn't depend upon the presence/absence of the key, and 2: the key won't get added in the middle of a crank (the metered region) because that only happens in the finalizer callback, and our xsnap.c ensures that finalizers only happen at end-of-crank (when we call gcAndFinalize).

I'll update the comment to make it clear that the reason our finalizers run deterministically is because we wrote xsnap.c to make it that way, and that JS in general does not guarantee that in the slightest.

warner added 4 commits August 13, 2021 14:29
The `meterControl` tool will give liveslots a way to temporarily disable
metering, so it can perform GC-sensitive operations without affecting the
vat's metering results.

Each supervisor provides their own version: all are dummy facades except for
XS.

refs #3458
We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random
point in the middle of a turn. Any code which calls `wr.deref`, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes `convertSlotToVal`, and therefore `m.unserialize`.

We cannot allow metering results to diverge between validators, because:

* 1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
* 2: it will change the large-scale Meter value, which is reported to
userspace
* 3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all `dispatch.*` methods, which
runs entirely within liveslots:

* `dispatch.deliver` performs argument deserialization in the first turn,
  then executes user code in the second and subsequent turns
* `dispatch.notify` does the same
* the GC deliveries (`dispatch.dropExport`, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
`slotToVal`).

Finally, we disable metering for all turns of the post-crank GC `finish()`
call. This excludes all invocations of the finalizer callbacks, as well as
all the `processDeadSet` code which is highly sensitive to the results.

closes #3458
For some reason, most of the deserialization tests in test-marshal.js have
been creating a full SwingSet controller and then throwing it away. This is
an ancient leftover from the PlaygroundVat days, and removing it reduces the
test's runtime from about 20 seconds to just 1.
The use of `await` inside a try/finally block would expose a subtle timing
issue (if `thunk()` throws synchronously, the finalizer would run in the same
turn as the thunk, rather than in a subsequent turn). Our recommended style
avoids conditional `await` for just that reason, so this updates
`runWithoutMeteringAsync` to use a better approach.
@warner warner force-pushed the 3458-unmetered-liveslots branch from dfeff9e to 64e4f2f Compare August 13, 2021 21:29
@warner warner merged commit 3335547 into master Aug 13, 2021
@warner warner deleted the 3458-unmetered-liveslots branch August 13, 2021 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liveslots should not meter GC-sensitive code paths
3 participants