Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC syscall sensitivity through reanimate and reanimateDurableKindID #7142

Closed
warner opened this issue Mar 9, 2023 · 9 comments · Fixed by #7333
Closed

GC syscall sensitivity through reanimate and reanimateDurableKindID #7142

warner opened this issue Mar 9, 2023 · 9 comments · Fixed by #7333
Assignees
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet vaults_triage DO NOT USE

Comments

@warner
Copy link
Member

warner commented Mar 9, 2023

Describe the bug

I think I found another two ways in which syscalls are sensitive to GC activity. I've got a fast (non-kernel-based) test for them as well.

The first arises when reanimateDurableKindID is used to regenerate a KindHandle (i.e. deserializing one while pulling it out of a virtual or durable MapStore). To build the Representative, we must read the durableKindDescriptor out of the vatstore. If the deserialization is able to re-use an existing Representative, the syscall.vatstoreGet() won't happen:

function reanimateDurableKindID(vobjID) {
const kindID = `${parseVatSlot(vobjID).subid}`;
const raw = syscall.vatstoreGet(`vom.dkind.${kindID}`);
raw || Fail`unknown kind ID ${kindID}`;
const durableKindDescriptor = JSON.parse(raw);

The second arises when reanimate is used to regenerate a normal Kind instance's Representative. Same issue, this time reanimate calls makeRepresentative():

function reanimate(baseRef) {
// kdebug(`vo reanimate ${baseRef}`);
const innerSelf = cache.lookup(baseRef, false);
const [toHold] = makeRepresentative(innerSelf, false);

which calls ensureState():

function ensureState() {
if (innerSelf.rawState) {
cache.refresh(innerSelf);
} else {
innerSelf = cache.lookup(innerSelf.baseRef, true);
}
}

whose cache.lookup(.., true) will cause the LRU cache to do a vatstoreGet to read the state data if it isn't already in the cache.

The issue I'm looking at is that makeRepresentative is called only when GC has happened and deserialization fails to find a Representative in the slotToVal weakref. Imagine a virtual object that is created and stored in baggage during step 1, but is then dropped from userspace (it becomes UNREACHABLE). Then time passes, and we look at two cases. In case A, GC happens, moving it to COLLECTED and FINALIZED, removing it from slotToVal. In case B, GC does not happen, leaving it in UNREACHABLE. Then, in step 2, something reads the value out of baggage again.

In case A, the deserialization in step2 fails to find the vref, so reanimate() is called to build a replacement, which calls makeRepresentative. If the state had fallen out of the LRU cache, this will cause a vatstoreGet to fetch the state data. In case B, renimate() is not called, and no vatstoreGet will happen.

I think our testing might not have managed to evict the state data from the cache, so we missed this case. Also, we don't have a lot of mock-the-GC tests: test-liveslots.js has the necessary mocks for WeakRef and FinalizationRegistry, but they weren't exported or put in a shared library, so no other tests were using those tools.

Incidentally, the presence or absence of the state data in the cache is deterministic function of the sequence of calls to the LRU cache, like lookup and refresh, which are driven by (deterministic) calls to the state object's getters and setters, but also by calls to makeRepresentative. If makeRepresentative is called with GC sensitivity, then the LRU cache contents will be GC-sensitive, adding another variable to the "will vatstore reads happen or not" question. I think we convinced ourselves that the determinism of getters and setters made this safe, without realizing that reanimate itself could be a source of trouble.

I've written a new test to exercise this, on the 7142-gc-sensitivity branch, which compares the syscalls made by case A and case B, and asserts that they are equal. Those tests currently fail, with an extra vatstoreGet appearing in case A.

@warner warner added bug Something isn't working SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Mar 9, 2023
warner added a commit that referenced this issue Mar 9, 2023
This exercises three cases in which syscalls are sensitive to GC
activity:

* reanimateDurableKindID, when regenerating a KindHandle (#7142)
* reanimate, when regenerating a plain Representative (#7142)
* reanimateCollection, when regenerating a virtual collection (#6360)
@warner
Copy link
Member Author

warner commented Mar 9, 2023

I added a third test to exercise the reanimateCollection() issue described in #6360 and #6760, using the same framework. It also fails, in the expected way.

@mhofman
Copy link
Member

mhofman commented Mar 9, 2023

Do the extra vatstoreGet seem compatible with some of the observations reported in #6784 ?

@warner
Copy link
Member Author

warner commented Mar 9, 2023

I think so. It's hard to tell because we only learn the one diverging syscall, and to be more confident I'd want the replay to continue (instead of being halted) and compare a larger span of calls from both sides of the divergence. But they're touching keys that are plausible.

@warner
Copy link
Member Author

warner commented Mar 9, 2023

So I can think fixes for two of these that are only moderately invasive.

reanimateDurableKindID

Each durable Kind has a durableKindDescriptor record. This is created during makeKindHandle, and initialized to contain two static fields (kindID and tag), and also the nextInstanceID counter which is incremented each time a new instance is created. A little while later, when defineDurableKind is called with the handle, we populate either .facets (with a list of facet names) or .unfaceted = true.

We keep a copy of the descriptor in RAM, in a Map named kindIDToDescriptor, for the lifetime of the incarnation. Every time we modify it, we also save it to the DB, with a vatstoreSet('vom.dkind.${kindID}'), so that a future incarnation can repopulate the RAM copy at startup.

The KindHandle behaves a lot like a virtual object (although not created with defineKind, which would be recursive). The KindHandle has a vref, and is registered in slotToVal/valToSlot just like regular virtual-object Representatives. As a result, you can export it from the vat, or save it in other virtual storage (e.g. baggage), and you can drop your in-RAM copy of the KindHandle. The next time GC happens, your KindHandle representative will be collected, and removed from slotToVal. If someone sends the vref back into the vat, or (more likely) you read the handle out of baggage, the unserialize will need to create a new Representative/KindHandle for you.

The code which does that regeneration is a function named reanimateDurableKindID :

function reanimateDurableKindID(vobjID) {
    const kindID = `${parseVatSlot(vobjID).subid}`;
    const raw = syscall.vatstoreGet(`vom.dkind.${kindID}`);
    raw || Fail`unknown kind ID ${kindID}`;
    const durableKindDescriptor = JSON.parse(raw);
    const kindHandle = Far('kind', {});
    linkToCohort.set(Object.getPrototypeOf(kindHandle), kindHandle);
    unweakable.add(Object.getPrototypeOf(kindHandle));
    kindHandleToID.set(kindHandle, kindID);
    // we load the descriptor (including .nextInstanceID) every time
    // the vat makes a new DurableKindHandle representative (during
    // deserialization). The handle is held weakly and can be dropped,
    // but the KindID-to-descriptor mapping remains in memory.
    kindIDToDescriptor.set(kindID, durableKindDescriptor);
    return kindHandle;
  }

This will get called in two contexts. The first (and probably the only one we thought carefully about) is after a vat upgrade, in a new incarnation, when buildRootObject() is fulfilling its responsibility to reattach behavior to all the kinds created by its predecessor. renimateDurableKindID will be called as userspace pulls the KindHandles out of baggage. In this case, there is a pretty close correspondence between userspace behavior (which is deterministic) and calls to reanimateDurableKindID, because the slotToVal tables start out empty, so deserialization never bypasses the reanimation call.

The second is later, after a KindHandle has been in RAM for a while, but then gets dropped and maybe garbage collected. If userspace then re-fetches the handle from baggage, deserialization may or may not call reanimateDurableKindID, and this is the divergence that we need to address.

If this is the first time we've called reanimateDurableKindID in a given (vat, incarnation) domain, it needs to fetch the descriptor from vatstoreGet and populate the in-RAM kindDToDescriptor map. However, the current implementation repeats this process on subsequent reanimation calls, which causes a divergent (depending on GC state) vatstoreGet syscall.

So, given that we're already spending the RAM on each Kind (we decided that Kinds are low cardinality), a simple fix is to change kindHandleToID from a WeakMap to a strong Map. That would keep the KindHandle in RAM forever (for the remainder of the incarnation), which would prevent it from being GCed, which would inhibit reanimation.

Another approach would be to change reanimateDurableKindID to treat "create the KindHandle" as a separate task from "populate kindIDToDescriptor". It would create the KindHandle each time (allowing them to be collected and regenerated like normal Representatives), but if kindIDToDescriptorwas already populated, it would refrain from doing avatstoreGet`.

The second approach would continue to be testable with the branch I mentioned above, where the kindHandleToID WeakMap would be mocked, and triggered by the testing framework. The first approach would not: we'd rely upon correctness-by-construction to be confident that reanimateDurableKindID would only be called once per incarnation, or we might add some additional safeguards/assertions (a strong table, indexed by KindID, although that starts to look just like the existing tables).

reanimateCollection

When each virtual collection is created, we get a "schemata" (optional keyShape and valueShape Patterns), and a string label. The patterns will be enforced during API operations like init and set, and the label is interpolated into the resulting error messages.

Currently, all three values are passed into summonCollection and thence into summonCollectionInternal, where there are embedded into the error message strings, or closed over by the valueShape-checking serializeValue/unserializeValue functions, or closed over by individual has/get/init methods.

The virtual collection has a Representative (collections are high-cardinality, although we didn't completely expect that, so we still have some RAM cost for inactive/paged-out collections). These Representatives can be dropped from RAM and then regenerated, just like KindHandles. When this happens, and deserialization fails to find an entry in slotToVal for the collection's vref, it will call renimateCollection:

function reanimateCollection(vobjID) {
    const { id, subid } = parseVatSlot(vobjID);
    const kindName = storeKindIDToName.get(`${id}`);
    const rawSchemata = JSON.parse(syscall.vatstoreGet(prefixc(subid, '|schemata')));
    const [keyShape, valueShape] = unserialize(rawSchemata);
    const label = syscall.vatstoreGet(prefixc(subid, '|label'));
    return summonCollection(false, label, subid, kindName, keyShape, valueShape);
  }

The two vatstoreGet calls fetch the schemata and label, so they can be passed into summonCollection and be curried into the collection methods.

As above, the sometimes-happens sometimes-not divergence of calls to reanimateCollection causes the syscall trace to be sensitive to GC of the virtual-collection Representatives.

One moderately invasive fix would be to defer getting the schemata/label data until userspace performed an actual collection API call (init/get/set/has/delete). Instead of passing the shapes and label into summonCollection, we would store them in a #7138 -style Cache which is indexed by collectionID. Then, everwhere inside init/get/etc where it needs one of these values, it calls cache.get(collectionID). That would skip the vatstore operation during reanimate (which is sensitive to GC) and move it to the first userspace API call of each crank (which is not).

So e.g.

    const invalidKeyTypeMsg = `invalid key type for collection ${q(label)}`;

would become:

    const makeInvalidKeyTypeMsg = () => `invalid key type for collection ${q(cache.get(collectionID).label)}`;

and

const serializeValue = value => {
  if (valueShape !== undefined) { ...

becomes

const serializeValue = value => {
  const { valueShape } = cache.get(collectionID);
  if (valueShape !== undefined) { ...

No references to the cache object would be allowed in the top lexical scope of summonCollectionInternal. They may only appear in the API functions, or in functions (like makeInvalidKeyTypeMsg or serializeValue) which are only called from those API functions.

The cache would be built with a readBacking function that knows about the vatstore key to be read, and also knows that a missing keyShape defaults to M.any(). The initial population of the collection data (in makeCollection would be rewritten to perform the vatstore writes of this data with cache.set calls, to ensure that the data is available in the cache right away. This will delay the actual writes of that data until end-of-crank, which might cause some test churn.

@warner
Copy link
Member Author

warner commented Mar 9, 2023

reanimate

For the third issue, the real problem is our need to know the list of state property names too early. reanimate needs to make the Representative, building the Representative requires a state object, and we currently build state out of a batch of getters/setters (one per known state property). We allow each instance of a single Kind to have different state properties, fixed during instance construction (rather than Kind definition), which means knowing the KindID is insufficient to learn the property names, forcing us to do a vatstoreGet to learn them. Doing this syscall during reanimate causes the GC sensitivity.

One (invasive) approach would be to change defineKind to somehow nail down the list of state property names, and then store that list into the Kind descriptor. Kinds (unlike their instances) are low-cardinality, so we're allowed to keep them in RAM. During reanimate, we'd fetch the list of property names from our RAM table, and use to build the getters/setters of the state object, allowing us to defer any vatstoreGet calls until the getter/setter was actually invoked (which is a consequence of userspace behavior, and thus not GC-sensitive).

A differently-invasive approach would retain the "each instance has different properties" quality, but would instead change state to be a Proxy instead of a collection of getters and setters. That would avoid the need to know the property names at reanimate time.

The last time we considered the Proxy approach, I remember getting stuck with the difficulty of how to harden the Proxy without knowing exactly the property names. I'm now wondering if we really need to harden it. Kind definitions have access to state, but they are also supposed to protect it: it's an internal service, not part of their external/defensive API surface. I don't think we'd be increasing their vulnerability by making state non-frozen.

If that works out, then the approach would be:

  • reanimate calls makeRepresentative as usual
  • makeRepresentative creates a Proxy for state
    • all traps which touch properties (get/has/set/deleteProperty/ownKeys/getOwnPropertyDescriptor) would start with a cache.get(baseRef), to grab the instance's state data, from which it could learn the names and/or values of the state
    • the traps would throw if the caller was trying to read or write a missing property name
    • the prototype traps could either use a null prototype, or the normal Object.prototype
    • isExtensible() would report false

Callers would observe that Object.isExtensible would report false, and (independently) attempts to add new properties would fail (because the set trap would throw). Object.isFrozen would report false. I'm guessing Object.isSealed would report false, not sure if that's an issue.

@mhofman
Copy link
Member

mhofman commented Mar 9, 2023

For reference, I had suggested using a Proxy for the state object in #5170

@warner
Copy link
Member Author

warner commented Mar 10, 2023

Ah, yeah, thanks for the pointer, I see you thought through all of the issues I was wondering about above.

I walked through the reanimate issue with @erights , and he suggested a different approach that doesn't need a Proxy. His realization is that the state object is actually delivered to userspace code via an Exo, by virtue of our call to defendPrototype:

contextMapTemplate = new WeakMap();
prototypeTemplate = defendPrototype(
tag,
contextMapTemplate,
behavior,
thisfulMethods,
interfaceGuard,
);

then later we build self like this:

const self = harden({ __proto__: prototypeTemplate });
context.self = self;
contextMapTemplate.set(self, context);

So the userspace Representative is actually a per-instance object (whose this is a key in contextMap), with a prototype that points at a per-Kind object, with one method per behavior method. These last methods can either be written in "exo" style, where userspace does foo(args) { this.state.prop=1; this.self.bar() }, or in "kind" style, where userspace does foo(args, { state, self }) { .. }. The "exo" style methods are built like this:

        method(...args) {
          this || Fail`thisful method ${methodTag} called without 'this' object`;
          const context = getContext(this);
          return apply(behaviorMethod, context, args);
       }

where getContext is defined in the same file, as:

  const getContext = self => {
    const context = contextMap.get(self);
    context || Fail`${q(methodTag)} may only be applied to a valid instance: ${self}`;
    return context;
  };

So each time someone invokes a method on a Representative, it follows the prototype chain up to the per-Kind object, whose method starts by looking up this (the Representative, or facet) with getContext(), which looks it up in the contextMap provided by the VOM. The value in this map is the "context" object, which holds { state, self } or { state, facets }.

So.. we change defendPrototype to accept getContext directly, instead of accepting contextMap and wrapping it. Then we change the VOM to provide a getContext which looks up it's argument (a facet / Representative) in valToSlot to find the vref, convert that facet vref into a baseRef for the cohort (remember we have one state per cohort, shared among all facets), then do stateCache.get(baseRef) to retrieve the state record.

getContext is only called when someone invokes a method on a Representative. This is a deterministic function of userspace, and happens after any reanimate call. So the invocation of cache.get will not be GC-sensitive.

As a bonus, if someone is only using the Representative for it's identity, and doesn't actually invoke any methods, then we don't need state at all, and we won't do the vatstoreGet at all. For marker/rights-amplification/Brand objects, that could be a meaningful performance improvement.

warner added a commit that referenced this issue Mar 11, 2023
This exercises three cases in which syscalls are sensitive to GC
activity:

* reanimateDurableKindID, when regenerating a KindHandle (#7142)
* reanimate, when regenerating a plain Representative (#7142)
* reanimateCollection, when regenerating a virtual collection (#6360)
warner added a commit that referenced this issue Mar 11, 2023
This exercises three cases in which syscalls are sensitive to GC
activity:

* reanimateDurableKindID, when regenerating a KindHandle (#7142)
* reanimate, when regenerating a plain Representative (#7142)
* reanimateCollection, when regenerating a virtual collection (#6360)
warner added a commit that referenced this issue Mar 17, 2023
This exercises three cases in which syscalls are sensitive to GC
activity:

* reanimateDurableKindID, when regenerating a KindHandle (#7142)
* reanimate, when regenerating a plain Representative (#7142)
* reanimateCollection, when regenerating a virtual collection (#6360)
@ivanlei ivanlei added the vaults_triage DO NOT USE label Mar 20, 2023
@warner warner self-assigned this Mar 27, 2023
@warner
Copy link
Member Author

warner commented Apr 7, 2023

This ticket is about sensitivity via three pathways:

@warner
Copy link
Member Author

warner commented Apr 7, 2023

Oh, in an earlier comment I said:

So, given that we're already spending the RAM on each Kind (we decided that Kinds are low cardinality), a simple fix is to change kindHandleToID from a WeakMap to a strong Map. That would keep the KindHandle in RAM forever (for the remainder of the incarnation), which would prevent it from being GCed, which would inhibit reanimation.

and apparently I did exactly that in 18f6a1d , which landed as part of PR #7138 . So this part is already fixed.

@mergify mergify bot closed this as completed in #7333 Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet vaults_triage DO NOT USE
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants