Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite VOM LRU cache #7138

Merged
merged 11 commits into from
Apr 5, 2023
Merged

rewrite VOM LRU cache #7138

merged 11 commits into from
Apr 5, 2023

Conversation

warner
Copy link
Member

@warner warner commented Mar 9, 2023

rewrite virtual object manager to use the new cache

This overhauls the VOM, to drastically simplify the creation of
virtual-object Representatives, and remove the syscall sensitivity to
GC.

The old code created state (with accessors for each state property)
and context (which holds both state and self or the facets
cohort) each time it needed to create a Representative, which is every
time a virtual/durable vref is deserialized but a Representative was
not present in the slotToVal table (thus sensitive to GC). This was
partially a holdover from the time when Representatives were behavior
methods closing over state.

Since then, we moved to a model where Representatives are empty
objects, with a special prototype that holds the behavior methods. We
use defendPrototype from exo-tools.js to build this prototype,
which wraps each behavior method with code that looks the
Representative up in a "context map" to find the context
object. This reduces memory usage: we share the prototype object among
all instances of the Kind, and the Representatives have no own
properties.

It also enables the key improvement of this commit: we can defer
creating context until userspace actually invokes a method on the
Representative. We replace the context map with a contextProvider: a
function that takes the Representative and looks up (or creates) a
context. Since userspace invocation of methods is unaffected by GC,
calls to the contextProvider are too, making it a safe place to
perform syscalls (such as the vatstoreGet needed to learn the state
properties and create the state accessors).

The cache ensures that we create the context at most once per
delivery.

We use two instances of the cache: one for context (including
state), and a second for the actual state data.

We remove virtualObjectCacheSize from vat options, since the new
cache does not have a configurable size limit.

@warner
Copy link
Member Author

warner commented Mar 9, 2023

cc @FUDCo

@datadog-full-agoric
Copy link

datadog-full-agoric bot commented Mar 9, 2023

Datadog Report

Branch report: 6693-vo-lru-cache
Commit report: 1c96aa8

agoric-sdk: 384 Failed (0 Known Flaky), 0 New Flaky, 982 Passed, 31 Skipped, 22m 48.56s Wall Time

❌ Failed Tests (384)

This report shows up to 5 failed tests.

  • make › make and exec - agoric.cosmic-swingset - Details

    Expand for error
     ---
         name: Error
         message: Exited because no new tests completed within the last 1200000ms of inactivity
         at: >-
           Error: Exited because no new tests completed within the last 1200000ms of
           inactivity
               at TapReporter.writeTimeout (file:///home/runner/work/agoric-sdk/agoric-sdk/node_modules/ava/lib/reporters/tap.js:148:17)
               at TapReporter.consumeStateChange (file:///home/runner/work/agoric-sdk/agoric-sdk/node_modules/ava/lib/reporters/tap.js:218:10)
               at file:///home/runner/work/agoric-sdk/agoric-sdk/node_modules/ava/lib/reporters/tap.js:72:45
               at file:///home/runner/work/agoric-sdk/agoric-sdk/node_modules/emittery/index.js:365:13
     ...
    
  • attestation › attestation › before hook - agoric.inter-protocol - Details

    Expand for error
     ---
         name: AssertionError
         message: Rejected promise returned by test
         values:
           'Rejected promise returned by test. Reason:': |-
             Error {
               message: 'illegal cross-facet access',
             }
         at: >-
           Object.get (packages/swingset-liveslots/src/virtualObjectManager.js:174:34)
     ...
    
  • attestation › attestation › before hook - agoric.inter-protocol - Details

    Expand for error
     ---
         name: AssertionError
         message: Rejected promise returned by test
         values:
           'Rejected promise returned by test. Reason:': |-
             Error {
               message: 'illegal cross-facet access',
             }
         at: >-
           Object.get (packages/swingset-liveslots/src/virtualObjectManager.js:174:34)
     ...
    
  • attestation › userFlow › before hook - agoric.inter-protocol - Details

    Expand for error
     ---
         name: AssertionError
         message: Rejected promise returned by test
         values:
           'Rejected promise returned by test. Reason:': |-
             Error {
               message: 'illegal cross-facet access',
             }
         at: >-
           Object.get (packages/swingset-liveslots/src/virtualObjectManager.js:174:34)
     ...
    
  • attestation › userFlow › before hook - agoric.inter-protocol - Details

    Expand for error
     ---
         name: AssertionError
         message: Rejected promise returned by test
         values:
           'Rejected promise returned by test. Reason:': |-
             Error {
               message: 'illegal cross-facet access',
             }
         at: >-
           Object.get (packages/swingset-liveslots/src/virtualObjectManager.js:174:34)
     ...
    

@warner
Copy link
Member Author

warner commented Mar 9, 2023

Yeah, I think I reintroduced the main GC sensitivity (the one on reanimating Representatives). I want to fix this by makine state into a Proxy instead of enumerated getters/setters, so we can defer looking up the list of property names until someone looks for them (since lookup/get/set are userspace behaviors, we're allowed to react to them, but reanimate is GC-sensitive). But I think the last time we considered this, the limitations of a harden()ed Proxy came up.

@warner
Copy link
Member Author

warner commented Mar 9, 2023

Actually, I believe that sensitivity was pre-existing, see #7142. If so, this branch is just as problematic, but at least it's a lot simpler.

@warner warner force-pushed the 6693-vo-lru-cache branch 2 times, most recently from 0a29ba3 to 9264d6b Compare March 13, 2023 16:21
@warner warner force-pushed the 6693-vo-lru-cache branch 2 times, most recently from 7d562a0 to ef539c1 Compare March 24, 2023 07:42
@codecov
Copy link

codecov bot commented Mar 24, 2023

Codecov Report

Merging #7138 (c645b4a) into master (2e05748) will decrease coverage by 0.28%.
The diff coverage is 94.52%.

❗ Current head c645b4a differs from pull request most recent head 54797ea. Consider uploading reports for the commit 54797ea to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7138      +/-   ##
==========================================
- Coverage   71.00%   70.72%   -0.28%     
==========================================
  Files         450      449       -1     
  Lines       86390    85575     -815     
  Branches        3        3              
==========================================
- Hits        61338    60526     -812     
+ Misses      24984    24983       -1     
+ Partials       68       66       -2     
Impacted Files Coverage Δ
...ckages/SwingSet/src/controller/initializeKernel.js 99.10% <ø> (-0.01%) ⬇️
.../SwingSet/src/kernel/vat-loader/manager-factory.js 73.10% <ø> (-0.23%) ⬇️
...kages/SwingSet/src/kernel/vat-loader/vat-loader.js 62.19% <ø> (-0.10%) ⬇️
packages/SwingSet/src/types-external.js 100.00% <ø> (ø)
...ges/swingset-liveslots/src/virtualObjectManager.js 92.50% <94.08%> (+3.41%) ⬆️
packages/store/src/patterns/exo-makers.js 97.10% <100.00%> (+0.06%) ⬆️
packages/store/src/patterns/exo-tools.js 82.01% <100.00%> (+0.26%) ⬆️
packages/swingset-liveslots/src/liveslots.js 92.69% <100.00%> (+0.62%) ⬆️
...ingset-liveslots/tools/fakeVirtualObjectManager.js 100.00% <100.00%> (ø)
...ges/swingset-liveslots/tools/fakeVirtualSupport.js 96.17% <100.00%> (-0.54%) ⬇️

... and 24 files with indirect coverage changes

Impacted file tree graph

@warner
Copy link
Member Author

warner commented Mar 24, 2023

Note: this changes the VOM, but not the collection manager (which still has two sources of GC sensitivity). I'll address those in a separate PR.

Also note that for some reason, this change increases the sensitivity of the real-GC tests to Node.js and AVA being weird (#5575), specifically in test/virtual-objects/test-virtualObjectGC.js. I increased the number of engineGC() calls to make them pass consistently again, at the cost of adding one or two seconds to the swingset-liveslots tests. I still have no idea what's happening, and I'm hoping that the extra calls will be enough to avoid new (or more frequent) intermittent failures.

@erights Please note that I'm changing exo-tools.js from a contextMap to a contextProvider function. I considered sticking with a Map-like interface ("an object with a .get method"), but changing the types to allow a non-Map was just as much surgery as switching to a proper function. We'll have to figure out how to upstream this into Endo eventually.. for now I think it's ok for us to keep pulling defendPrototype/etc from packages/store.

@datadog-full-agoric
Copy link

datadog-full-agoric bot commented Mar 24, 2023

Datadog Report

Branch report: 6693-vo-lru-cache
Commit report: db193b6

agoric-sdk: 6 Failed (0 Known Flaky), 0 New Flaky, 3458 Passed, 55 Skipped, 21m 25.16s Wall Time

❌ Failed Tests (6)

This report shows up to 5 failed tests.

  • virtualObjects › representatives › exercise cache - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         assertion: deepEqual
         values:
           'Difference (- actual, + expected):': |2-
               [
             -   [
             -     'get',
             -     'v1.vs.vom.rc.o+v10/2',
             -     undefined,
     ...
    
  • virtualObjects › representatives › strongly held cohort retains representative - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         assertion: deepEqual
         values:
           'Difference (- actual, + expected):': |2-
               [
             -   'compare old === new : false',
             +   'compare old === new : true',
               ]
         at: >-
     ...
    
  • virtualObjects › representatives › strongly held state retains representative - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         assertion: deepEqual
         values:
           'Difference (- actual, + expected):': |2-
               [
             -   'compare old === new : false',
             +   'compare old === new : true',
               ]
         at: >-
     ...
    
  • virtualObjects › representatives › weakly held cohort retains representative - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         assertion: deepEqual
         values:
           'Difference (- actual, + expected):': |2-
               [
             -   'compare old === new : false',
             +   'compare old === new : true',
               ]
         at: >-
     ...
    
  • virtualObjects › representatives › weakly held state retains representative - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         assertion: deepEqual
         values:
           'Difference (- actual, + expected):': |2-
               [
             -   'compare old === new : false',
             +   'compare old === new : true',
               ]
         at: >-
     ...
    

@warner warner force-pushed the 6693-vo-lru-cache branch from ef539c1 to 2ed0d9e Compare March 24, 2023 08:03
@warner warner force-pushed the 6693-vo-lru-cache branch from 2ed0d9e to 0a1c34e Compare March 25, 2023 04:30
@warner warner changed the base branch from master to 7240-forbid-durable-root March 25, 2023 04:30
@warner warner force-pushed the 7240-forbid-durable-root branch from 03c779e to b01ac26 Compare March 25, 2023 20:34
@warner warner force-pushed the 6693-vo-lru-cache branch from b24d868 to 4a83063 Compare March 25, 2023 20:36
Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, but the three files still to go are the ones with all the substance:

  • packages/swingset-liveslots/test/test-liveslots-mock-gc.js
  • packages/swingset-liveslots/test/test-liveslots-real-gc.js
  • packages/swingset-liveslots/src/virtualObjectManager.js

Comment on lines 109 to 119
/**
* @typedef { (representative: any) => any } ContextProvider
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about narrowing this?

Suggested change
/**
* @typedef { (representative: any) => any } ContextProvider
*/
/**
* @typedef {{self: any, state: any}} ClassContext
* @typedef {{self: any, facets: Record<string, any>}} ClassKitContext
* @typedef {(representative: any) => ClassContext} ContextProvider
* @typedef {Record<facetName: string, (facet: any) => ClassKitContext>} ContextProviderKit
*/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that changes some work that @erights just did to synchronize the agoric-sdk copy of exo-tools with the Endo one

@erights what do you think about this? If you prefer this, let's do another round of (fix in agoric-sdk, fix in Endo), and I'll rebase this branch to include those changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, also note: the change this comment is referring to has been landed already, as part of #7304 , so further improvements to the type shouldn't affect this PR (unless swingset is doing something wrong)

Comment on lines +59 to +60
* This cache is part of the virtual object manager and is not intended to be
* used independently; it is exported only for the benefit of test code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really not already have a suitable preexisting implementation of this? I wonder if it merits a ticket/TODO for e.g. merging with makeReadCachingStorage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really hesitant to share an implementation with other customers whose needs may diverge, or which might provoke the package that hosts the cache to add new dependencies (I'd want to put the cache in its own package, with no additional dependencies). Also I think the "hold until flushed" behavior might not be useful in as many other contexts as the more traditional LRU.

That said, makeReadCachingStorage (in cosmic-swingset) looks like an 80% match. The overhead of a new package feels like it would outweigh the savings made by sharing this with the other cache, but maybe it'd make sense.

packages/swingset-liveslots/src/cache.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/test/test-cache.js Outdated Show resolved Hide resolved
@warner warner force-pushed the 7240-forbid-durable-root branch from ae15780 to 9c3b8f5 Compare March 28, 2023 01:24
@warner warner force-pushed the 6693-vo-lru-cache branch from 4a83063 to c645b4a Compare March 28, 2023 01:25
Base automatically changed from 7240-forbid-durable-root to master March 28, 2023 02:16
@warner warner force-pushed the 6693-vo-lru-cache branch from c645b4a to 54797ea Compare March 30, 2023 23:30
@warner
Copy link
Member Author

warner commented Mar 30, 2023

two known tests still failing:

  • (Swingset): yarn test test/virtualObjects/test-representatives.js -m "weakly held cohort retains representative" (and others in the same file)
  • (zoe): yarn test test/swingsetTests/upgradeCoveredCall/test-coveredCall-service-upgrade.js
    • somehow it's not reattaching a Kind during upgrade

@erights
Copy link
Member

erights commented Mar 31, 2023

Be aware of #7286 , which I think is a complementary optimization. But I'm not sure.

Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure of virtualObjectManager.js makes sense and seems good, but I would like some more complete commentary. Everything else is just suggested tweaks to code organization and/or messaging.

I haven't yet reviewed test-liveslots-{mock,real}-gc.js, but am pausing this for now to work on liveslots export/representative visibility and #7083.

packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/virtualObjectManager.js Outdated Show resolved Hide resolved
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I got nuthin' but nit picks.

packages/swingset-liveslots/src/liveslots.js Show resolved Hide resolved
@@ -26,12 +26,9 @@ test('cache', t => {

// new reads pass through immediately to backing store
t.is(c.get('key0'), 'value0');
t.deepEqual(log.splice(0), [['read', 'key0']]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splice is a confusing operation that I have to look up the docs for each time I use it. I'm wondering if it would be clearer to have a helper function that consumes the head of the log and returns it (which of course would be written with splice, but only in one place). Seems like we have something like that in other tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, see #7138 (comment)

I like splice for supporting the ability to make assertions about a batch of entries rather than one at a time, and really this should be an external test helper where the implementation is explained inline but never really considered at call sites because it Just Works as expected (above suggestion updated with that in mind).

/**
 * Clear the contents of a log array and assert that they match expectations.
 *
 * @param {import('ava').ExecutionContext} t
 * @param {unknown[]} log
 * @param {unknown[]} expected
 * @returns {void}
 */
const clearAndAssertLog = (t, log, expected) => {
  // `log.splice(0)` removes elements starting at index 0 (i.e., every element)
  // and returns them in an array.
  const entries = log.splice(0);
  t.deepEqual(entries, expected);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, two votes for a helper, I'll start using one in the future

@warner warner force-pushed the improve-swingset-tests branch 2 times, most recently from 4513d7e to 7ca461c Compare April 4, 2023 23:40
@warner warner force-pushed the 6693-vo-lru-cache branch 2 times, most recently from 608857b to 14af542 Compare April 5, 2023 00:56
Base automatically changed from improve-swingset-tests to master April 5, 2023 01:14
@warner warner force-pushed the 6693-vo-lru-cache branch from 14af542 to 9beef4a Compare April 5, 2023 01:19
@warner warner self-assigned this Apr 5, 2023
@warner warner added SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Apr 5, 2023
warner added 11 commits April 4, 2023 21:55
This cache accumulates all data until an explicit flush, at which
point it writes back all dirty state, and forgets all state.

This avoids the somewhat unpredictable behavior of an LRU cache, which
caused us problems with GC sensitivity. It suffers from the potential
for larger RAM usage between flushes, but is quite simple, and yields
easy-to-debug syscall behavior.

We have other caches in our platform, but liveslots needs something
with these specific properties, and I don't know what the needs in the
rest of the platform are, so I'm fine with not sharing this
implementation with other packages, at least for now.

refs #6693
This overhauls the VOM, to drastically simplify the creation of
virtual-object Representatives, and remove the syscall sensitivity to
GC.

The old code created `state` (with accessors for each state property)
and `context` (which holds both `state` and `self` or the `facets`
cohort) each time it needed to create a Representative, which is every
time a virtual/durable vref is deserialized but a Representative was
not present in the slotToVal table (thus sensitive to GC). This was
partially a holdover from the time when Representatives were behavior
methods closing over `state`.

Since then, we moved to a model where Representatives are empty
objects, with a special prototype that holds the behavior methods. We
use `defendPrototype` from `exo-tools.js` to build this prototype,
which wraps each behavior method with code that looks the
Representative up in a "context map" to find the `context`
object. This reduces memory usage: we share the prototype object among
all instances of the Kind, and the Representatives have no own
properties.

It also enables the key improvement of this commit: we can defer
creating `context` until userspace actually invokes a method on the
Representative. We replace the context map with a `contextProvider`: a
function that takes the Representative and looks up (or creates) a
context. Since userspace invocation of methods is unaffected by GC,
calls to the contextProvider are too, making it a safe place to
perform syscalls (such as the vatstoreGet needed to learn the state
properties and create the `state` accessors).

The cache ensures that we create the context at most once per
delivery.

We use two instances of the cache: one for `context` (including
`state`), and a second for the actual state data.

For each baseref, the state data cache holds the full `capdatas`
record (separate capdata for each property name). It also holds a
sparsely/lazily-populated `valueMap`, a Map of unserialized
values. These are made available to `state` getters. When modifying
a property on `state`, we immediately serialize it and hold it in the
cache. We also perform refcount delta management at that time. Later,
during flush(), we reassemble the capdatas (some modified, some
pristine) into the value of a `syscall.vatstoreSet`.

We defer unserialization until a getter is called, which reduces the
constraints on userspace behavior, and making virtual objects behave
more like virtual collections.

We remove `virtualObjectCacheSize` from vat options, since the new
cache size is unlimited, and thus not configurable.

This changes the `slotToVal` mapping to hold the cohort *record*,
where previously it held an Array. This is for multi-facet Kinds,
whose list of named facets is learned during `defineKindMulti`. Each
Kind has a fixed list of facet names, so each facet has both a name
and an index.

Userspace receives these as a record, with one property per
facet.

Internally, we need the index to encode into the vref, e.g. `o+11/5:1`
means the second (0-indexed) facet of the 5th (1-indexed) instance of
the Kind with KindID 11. We also use the index when looking up the
export status, which is recorded in one vatstore key per baseref, with
one character per facet, so 'rrn' means the first two facets are
reachable, and the third is unreachable.

Previously, during construction or reanimation, we also created an
*Array* of facets, and recorded them in the slotToVal Map (under the
baseref as a key). When doing `convertSlotToVal()`, we'd extract the
baseref from the vref, use it to fetch the array, then use the integer
facet index to extract the specific facet.

This complicated the construction process, because we needed to create
two separate containers for the set of facets, which could have
separate lifetimes. We needed a third object (named `keepAlive`) to
help maintain the cyclic GC-reference subgraph containing all facets
and the user-visible facet record.

With this commit, we discard this extra Array, and store the facet
record itself in slotToVal. `convertSlotToVal()` asks the VRM to
convert the facet index (extracted from the vref) into a facet name,
and then looks it up in the slotToVal record. When registering a
cohort, `registerValue` iterates through the facet names to make sure
we register each separate facet in the reverse-pointing valToSlot
table.

This gets rid of `toRegister`, `toExpose`, and `keepAlive`.
* remove virtualObjectCacheSize from options
* change expectations about when vatstoreSet calls happen
* add some tests
* fix test-cross-facet to expect a different error message
for some reason, these recent changes have caused 20% of 'yarn test'
runs to fail, in test-virtualObjectGC.js, in varying test
functions. This happens even when I run just test-virtualObjectGC.js,
with --serial, --no-worker-threads, --concurrency 1. The symptom is
consistent with GC not successfully releasing the target object.

We never figured out why one engineGC() call seemed to help, but
apparently three calls work even better.
* remove cacheSize / virtualObjectCacheSize from options
* update convertSlotToVal to expect facets object, not array
This affects both static-vat creation options in
`config.vats.NAME.creationOptions.virtualObjectCacheSize`, the
dynamic-vat options to `E(vatAdminSvc).createVat()`, and the vatAdmin
`changeOptions()` facility.
I saw a test failure here, "historical inaccuracy in replay", which
might have been due to GC happening one way in the original, and a
different way in the replay (when running under Node.js). This feels
like an aspect of #5575, and this test isn't trying to exercise
anything about GC, so I'm just going to set defaultReapInterval to
'never' to inhibit BOYD, which should remove the problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants