Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support swingstore rollback, to match 'agd rollback' #7963

Open
warner opened this issue Jun 21, 2023 · 3 comments
Open

support swingstore rollback, to match 'agd rollback' #7963

warner opened this issue Jun 21, 2023 · 3 comments
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jun 21, 2023

What is the Problem Being Solved?

If a node executes a block badly (e.g. it performed a non-deterministic computation and landed on the losing side of the divergence), it will commit a divergent state, with an AppHash that doesn't match the consensus. The node will not realize this until the next block arrives. A validator will reject all proposed blocks until it sees the weight of the chain is against it (+2/3 votes), at which point it will announce a CONSENSUS FAILURE and stop attempting to process blocks. A follower will reach the same state as soon as it sees the next finalized block and the mismatched AppHash.

To recover a node in this state, we usually need to change the code somehow (e.g. rebuild with a different compiler, or address whatever triggered the divergence), then rewind the state to the previous block, then resume execution. Native cosmos-sdk chains keep all of their state in IAVL, which is a multi-version database, so they can use the rollback command to simply forget about the last block.

For our chain, SwingSet keeps its state in the swing-store SQLite database, not IAVL. The agd rollback command still exists, but it only reverts the IAVL state, which would cause the two databases to become unsynchronized and probably exploding horribly.

It would be great if the swingstore had the ability to rewind at least one block, so that agd rollback could invoke this at the same time.

@mhofman points out that it might be nice to rewind multiple blocks:

If we keep an undo log, or an ability to restore a previous height, we may want to keep more than one block. One issue we still have is how the cosmos DB is not flushed after every commit (for non signing nodes), so it's possible for swingstore to be a few blocks in the future in the case of a hardware failure. Here is the issue: #6736

Description of the Design

https://www.sqlite.org/undoredo.html is a SQLite design pattern for creating an "undo buffer": a set of DB triggers which reacts to any insert/update/delete in a table, by inserting a SQL statement that will reverse the effect into a different table (the undo table). To rewind the changes, you just execute all the statements in that undo table, then empty the undo table.

https://github.com/Ocead/sqlite-undo is a C extension which makes the pattern a bit more convenient. (However getting it built and incorporated into our better-sqlite3 JS binding package might prove to be a net inconvenience).

The basic approach would be:

  • teach swingstore about a "commit number", so hostStorage.commit() either returns or accepts it
    • these will really be block numbers, but swingstore doesn't know about blocks
    • initializing a swingstore from a tendermint state-sync snapshot will start at a non-zero commit number
    • the swingstore should include the current commit number in a separate table somewhere, so we can always tell how up-to-date the swingstore is, independently from the IAVL data
  • add the undo table, indexed by commit number
  • add the triggers, adding entries to the undo table under the commit number
    • I think we'll need separate triggers for each swingstore table: kvStore, snapstore, transcriptStore, bundlestore, and any added in the future
    • we should prune the rollback information after a handful of blocks, to avoid consuming too much space
      • hostStorage.commit() should delete the old rows before doing the commit
    • the undo table should not be included in the export-data, nor in a state-sync dump
  • add a hostStorage.rollback(commitNumber) API
    • this will execute the statements from the rollback buffer, then delete them
    • it should do a commit at the end of the process
    • it must be run outside of any other commit (ideally nobody holds the kernelStorage facet while a rollback is happening)
  • change the agd rollback command to somehow create a JS context, build a swingstore around the right directory, and invoke hostStorage.rollback()

We should think about what happens if agd rollback gets interrupted between the two DB commit points. If the swingstore rollback is a NOP when the requested block height already matches the recorded commit number, and the same is true for IAVL, then rollbacks are relatively idempotent. Interrupting the pair will cause a temporary mismatch (and an unusable .agoric/state/ directory), but re-running the agd rollback will finish rewinding the second DB and yield a functioning state vector once again.

Security Considerations

None, this is only reachable from outside the node/validator, where the operator running agd has complete control over the state of their node anyways.

Scaling Considerations

The size of the undo buffer probably won't be too bad, if we limit our rewind limit to a few blocks. We certainly need enough CREATE INDEX to let us delete the old rows (past the rewind limit) efficiently.

Deployment Considerations

The first time a swingstore is created with the new version (containing this feature), swingStore.js will create the undo tables and the triggers. They should not be visible to cosmic-swingset (neither during normal runtime, e.g. export-data or activityhash, nor during a state-sync export), so they should not impact consensus on the running chain. They should also not appear in the state-sync export, so should not affect consensus of the subsequent state-sync importers.

If the agd rollback command is run on a swingstore that, for whatever reason, does not yet have an undo table, or if it lacks sufficient undo entries to reach the requested block height, the rollback command should fail. It must not change the IAVL state in this case, which is probably an argument for doing the swingstore rewind first, and only doing the IAVL rewind if it succeeds. (Imagine someone upgrading their code, then attempting to rollback, so the DB lacks the historical data needed to fulfill the request).

Test Plan

  • unit tests in packages/swing-store
  • some kind of agd tests, I'm not sure how, but executing a block, halting at a given height, running agd rollback, then inspecting both the swingstore and the IAVL state to make sure they are correct, then resuming the chain
    • the test must make sure to cause swingstore changes in the block being rewound: it should not be empty
@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Jun 21, 2023
@mhofman
Copy link
Member

mhofman commented Jun 21, 2023

I had opened #7951 a couple days ago

@mhofman
Copy link
Member

mhofman commented Jul 10, 2023

As mentioned in #7951 it may be possible to also achieve this through WAL abuses, possibly coupled with vfs.

We also need to think through the impact on future parallel vat execution schemes (#6447)

@mhofman
Copy link
Member

mhofman commented Sep 27, 2024

Cloudflare is another example of systems that uses the WAL file + VFS to manage in their case replication of a SQLite DB, with compaction when the log grows too large. I'm not convinced it's worth the cost in our case, as an undo log would work just as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

2 participants