Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert swingstore from LMDB to Sqlite, phase 1 #6561

Merged
merged 11 commits into from
Dec 23, 2022
Merged

Conversation

FUDCo
Copy link
Contributor

@FUDCo FUDCo commented Nov 13, 2022

This PR is the first phase of the conversion from LMDB to Sqlite (in partial satisfaction of #3087).

This PR includes the replacement of the underlying database engine and the downstream changes that flow from it. This is done in a series of four commits (individually reviewable for clarity), each of which realizes a different step of the transformation.

Step 1 - convert from LMDB to Sqlite, using Sqlite to realize a dumb key-value store that replicates the former LMDB semantics exactly, retaining all of the supporting implementation that assumes a dumb key-value store with LMDB's transaction model.

Step 2 - eliminate the ephemeralSwingStore implementation, used for testing, that realized the dumb key-value store in memory using a JavaScript Map object, replacing it with Sqlite's :memory: pseudo-file which does essentially the same thing (i.e., stores everything ephemerally in RAM) but does it using Sqlite directly.

Step 3 - use Sqlite savepoints for crank-level commits (rather than an in-memory change buffer), with full Sqlite transactions for block-level commits. This eliminates a couple of layers of wrapper objects. A necessary consequence of this is that the crank activity hash is now computed as part of the swingstore itself.

Step 4 - eliminate the rest of the storage wrappers, making the swingstore essentially self contained.

The system that results after each step is fully functional in a way that is compatible with the prior behavior of SwingSet.
If this PR is merged with the master branch, the result will be entirely usable.

I am pushing this PR now in order to get review started. However, there is more to be done. In the broadest possible strokes, this remaining work falls into two categories:

  • Short term changes to

    • (a) rationalize the transaction model to account more gracefully for crank activity that is pre- or post-delivery, since we now have the ability to do more selective rollback. in particular, we believe there is a potential for improper commitment or non-commitment of pre-crank data mutations. Note that this risk is not new -- it's present with the LMDB-based store as well -- but the switch to Sqlite presents us with the opportunity to ensure that the cases we are concerned about are actually handled correctly.
    • (b) remove the use of iterators that require either the retention of a database cursor during iteration (which we can't tolerate) or fetching the entirety of a SELECT result into RAM before using it (which the implementation in this PR does but which exposes us to the risk of an unbounded memory suck from adversarial vat code), replacing them with something more like the vatstore's getAfter machinery.
  • Long term changes that exploit the capabilities of the SQL query mechanism in various ways. These include but are not limited to bulk operations for data deletion, improving GC of on-disk data, and removing the vatstore from the consensus SwingSet state. Most of these will require SwingStore API changes or additions, which argues for taking some care planning them before proceeding, hence their absence from this PR. In addition, we expect further exploration/experimentation to identify further optimization opportunities.

@FUDCo FUDCo added the SwingSet package: SwingSet label Nov 13, 2022
@FUDCo FUDCo requested a review from warner November 13, 2022 09:35
@FUDCo FUDCo self-assigned this Nov 13, 2022
@mhofman
Copy link
Member

mhofman commented Nov 14, 2022

The system that results after each step is fully functional in a way that is compatible with the prior behavior of SwingSet.

I'm wondering, does that mean that the behavior of node with the PR applied, and one without would be 100% consistent (long term deterministic)? Even without this, how confident would we be in applying this change to release-pismo? Which begs the question, do we believe there may be a way to upgrade from the current DBs to the new sqlite one? Aka export the content of LMDB in a new SQLite DB and restart from there (either as a manual script or as a built-in upgrade sequence).

@FUDCo
Copy link
Contributor Author

FUDCo commented Nov 14, 2022

I'm wondering, does that mean that the behavior of node with the PR applied, and one without would be 100% consistent (long term deterministic)?

There are three answers to this, which are respectively, yes, no, and it doesn't matter.

Yes, I've verified that the sequence of database operations are the same and that the action hashes match.

No, there is an exception to the action hashes matching, which is when there is an action in the crank containing a bundle reference and the bundle hash has changed because the code has changed. And of course changing the database engine is a code change. On the other hand, I observed this with tests that themselves had changes, and it's possible that since the database code changes are all in the kernel we'll be fine if the kernel bundle does not appear in any action hash (which, long term, we need it to not do if we want kernel upgrade to be possible). @warner?

And it doesn't matter (I think) because IIUC we'd be switching database engines as part of the bulldozer upgrade.

@mhofman
Copy link
Member

mhofman commented Nov 14, 2022

which is when there is an action in the crank containing a bundle reference and the bundle hash has changed because the code has changed

Right! This is an interesting case since this change has an impact on the kernel, mostly due to the swingstore api refactoring from what I gather. I thought however that we removed the kernel bundle from being saved in the DB? Or did this somehow impact the liveslot/supervisor bundle?

And it doesn't matter (I think) because IIUC we'd be switching database engines as part of the bulldozer upgrade.

I was wondering in the context of potentially having to deal with state-sync before a bulldozer upgrade. It sounds like with some effort, we might be able to swap the DB implementation for release-pismo. In that case we'd need to do other surgery, e.g. extract local XS snapshot hashes and move them in a different section of the DB (after verifying they're deterministic of course), so it'd definitely not be for the faint of heart.

@FUDCo FUDCo force-pushed the 3087-sql-swingstore branch from 85fd98a to 4ef54e2 Compare November 14, 2022 23:58
@warner
Copy link
Member

warner commented Nov 19, 2022

Short take: this looks great. I'll start reviewing properly now.

We no longer store the kernel bundle anywhere. We still need to build a kernel bundle (because the kernel runs in its own Compartment, and import-bundle is the only tool we have for loading more than a single eval-ed string into a Compartment), but we now do that on every boot, rather than bundling once during initializeSwingset and storing the result for all subsequent reboots. This makes a kernel upgrade easier, because you just run a new version of the host application (which imports a new version of @agoric/swingset-vat), but of course exports the "only run compatible versions" requirement onto the author of the host app.

I'll think about whether this could cause visible behavioral changes as I review. My intention was that we make this switch without worrying about such changes, especially because of my goal to replace vatstore.getKeyAfter with a simpler API, more like getNext, and that would entail a change to liveslots.

@FUDCo and walked through what the crankBuffer commit/abort replacement will be (which can happen after this lands). We concluded that we will:

  • decouple crankHash from what used to be the crankBuffer, so there's a kernelStorage.getCrankHash() that returns a hash of everything since the last call, and provide some way to fold that into the activityHash (or maybe that's the host's responsibility, not sure yet)
  • the "crank cycle" is the thing that defines crank 1, crank 2, etc
  • we pick the point on this cycle just before we pop something off the run-queue as the start, which is also the state we're in just before controller.run() is called, and also the state we're in just after controller.run() finishes
  • we define two SQLite SAVEPOINTs, the first is at this "idle" point, the second is just after the delivery/etc has been popped off the run-queue (so we need two swing-store APIs, and the kernel will call each of them, once per cycle)
  • then there are three possible outcomes:
    • 1: the delivery finishes normally, and the kernel wants to commit all changes. We call some swing-store API that says "delete all SAVEPOINTs, I won't use them"
    • 2: the delivery goes weird, and the kernel decides to discard the changes, and the delivery/run-queue item was consumed (so the kernel does not want to process it ever again), e.g. a createVat or upgradeVat failing. The kernel calls a swing-store API which rewinds to the second SAVEPOINT (the one taken after the pop). Then the kernel does cleanup work like pushing vat-admin messages about the failure. The swing-store API will also delete the first SAVEPOINT, since we won't be using it, and we don't want to let it clutter up RAM.
    • 3: the delivery goes weird, the kernel decides to discard the changes, but the kernel wants the delivery to be re-attempted later (e.g. a deliver killed the vat, and a re-delivery is the simplest way to provoke the right VAT_TERMINATED error). For this case, the kernel calls a third swing-store API which rewinds to the first SAVEPOINT (taken before the pop), and deletes the second SAVEPOINT (again to avoid clutter). Then the kernel does cleanup work like vat-admin messages.
  • Note that all kvstore changes, committed or abandoned, get rolled into the crankhash. Previously I think we only included committed changes, but we decided/concluded that they're all part of the deterministic behavior. The commit-vs-abort is part of consensus, so the reasons for choosing one vs the other must be within-consensus too.

And @FUDCo points out:

  • Note that in case 3, the rollback to the first SAVEPOINT clears out the second SAVEPOINT as it goes by with no additional action necessary. Also, in case 2, clearing the first SAVEPOINT is not just a matter of memory savings, it’s a backstop against bugs in later code hitting it in a later rollback.

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, ok sorry that took a while. Some small changes to make, some items for me to investigate further, and a few things to discuss.

packages/swing-store/src/sqlStreamStore.js Show resolved Hide resolved
packages/swing-store/package.json Outdated Show resolved Hide resolved
packages/swing-store/src/swingStore.js Outdated Show resolved Hide resolved
packages/swing-store/src/hasher.js Show resolved Hide resolved
packages/swing-store/test/test-hasher.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/test/vat-admin/test-create-vat.js Outdated Show resolved Hide resolved
packages/SwingSet/test/test-state.js Show resolved Hide resolved
packages/SwingSet/test/test-state.js Show resolved Hide resolved
@warner
Copy link
Member

warner commented Dec 20, 2022

Oh, an update on the WAL-mode and "checkpoint" operations. I just re-read https://sqlite.org/wal.html#performance_considerations , and realized there's a difference between the fsync() durability action and the "checkpoint" consolidate-the-WAL performance action.

It says that if PRAGMA synchronous is set to NORMAL (the default), the DB only does an fsync() during a checkpoint operation. That improves performance, but is not safe against a power failure: you could lose data that was written (and committed), until a checkpoint happens. These checkpoints happen automatically once the WAL file reaches 1000 pages, or when initiated manually, but I don't think we're comfortable with a durability vulnerability window like that.

If we set PRAGMA synchronous to FULL, then it performs an fsync() on every commit(), which sounds like what we want. We only do a real commit() once per block (no faster than once every 5 seconds), so the performance should be fine.

Once we do that, we can stick to automatic checkpoints. These will be opportunistic: if a reader has a transaction open (e.g. to copy data out of the DB, maybe for state-sync purposes), the checkpoint/WAL-compaction might have to stop, but it will pick up where it left off at the next opportunity, and the application doesn't need to know about it.

So:

  • when creating the DB, set PRAGMA synchronous to FULL, right next to where we set PRAGMA journal_mode=WAL
  • when doing a commit, don't force a checkpoint

@FUDCo
Copy link
Contributor Author

FUDCo commented Dec 22, 2022

@warner ready for re-review pending your final analyses of the test changes.

@warner
Copy link
Member

warner commented Dec 23, 2022

Ok, that looks good. Maybe rebase one last time. Thanks!

FUDCo and others added 5 commits December 23, 2022 01:12
The refcount changes in "createVat holds refcount" were correct, but
the comments gave the wrong reason. The original version (on trunk)
was wrong, but happened to work because of a failure-to-commit
bug/omission in kpResolution().

After using `c.kpResolution(kpid1)`, the refcount is indeed 3: one
from v1-bootstrap, one from the pin added by kpResolution, and one
from the kpid1 resolution value. Note that kpid1 itself has a zero
refcount by this point, but has not yet been collected, because we
only call processRefcounts() at the end of deliveries, not after the
decrefs performed by kpResolution. This is arguably a bug, but
fortunately we only call kpResolution() from tests.

The original version thought the refcount was 2, and the test only
passed because the incref performed by kpResolution was still sitting
in the crankbuffer, and the test code looked directly at the
kvStore (not the crankbuffer-wrapped version that kernelKeeper
uses). The crankbuffer had refcount=3, the DB had refcount=2, and the
test asserted that it was "2". The test comments didn't take into
account the reference from kpid1 (and probably assumed that kpid1 was
retired by that point). The subsequent delivery allowed the
crankbuffer to be flushed (incrementing), but also allows
processRefcounts() to run, which removes the kpid1 resolution value
reference (decrementing), making it look like there was no net
refcount change.

After switching to SQLite and removing the crankbuffer, the test is
correctly seeing the incref added by kpResolution, so it must assert
that the count is 3. But the extra refcount didn't come from the
'getHeld' call: v1-bootstrap only has a single c-list entry for
'held', not two as the comments implied (even if multiple objects or
Promises within v1-bootstrap held a reference, they all share a single
valToSlot and c-list entry).
@FUDCo FUDCo force-pushed the 3087-sql-swingstore branch from 47f1aba to 59de3ab Compare December 23, 2022 09:12
@FUDCo FUDCo added the automerge:rebase Automatically rebase updates, then merge label Dec 23, 2022
@mergify mergify bot merged commit 0d0f8f3 into master Dec 23, 2022
@mergify mergify bot deleted the 3087-sql-swingstore branch December 23, 2022 10:29
Comment on lines -369 to +539
closeStreamStore();
await doCommit(true);
await db.close();
commit();
db.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior on close from abort to commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, huh, yeah the previous true in doCommit(true) triggered an LMDB abort. I must have missed this during review (also we don't have any tests of abort-on-close(), and we never use it that way, applications just crash instead of calling close). Yeah, we should make this abort.

txnFinish(abort ? lmdbAbort : undefined);
return Promise.resolve(txnDone)
.then(() => {
trace(`${abort ? 'abort' : 'commit'}-tx`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trace was not kept in the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants