Off the WAL, on the FLOOR #116

rhashimoto · 2023-10-12T22:47:07Z

rhashimoto
Oct 12, 2023
Maintainer

For anyone unfamiliar with this post subject wordplay:

off-the-wall - highly unusual, bizarre

In discussing a prototype OPFS VFS using experimental browser features in Chrome, I speculated:

I'm pretty sure that the shared memory methods needed for WAL can be added but I have not attempted it.

I was hoping that actual shared-address-space memory was not required, and data could just be merged in/out whenever the lock/unlock/barrier methods were called. After reading through the SQLite WAL source, however, I'm thinking that might not be sufficient. SQLite uses atomic loads and stores on the shared memory buffer that bypass locking, and without understanding this fairly complicated piece of code a lot better than I do, I'm not confident that waiting to synchronize until a method is called is safe.

The web platform does support sharing WebAssembly memory so that is one implementation path. But going that route would also incur restrictive and annoying security constraints that I really want to avoid.

How do we work around that? Well, we could patch the SQLite source to make sure it calls us to synchronize when doing those atomic loads and stores. I think that's not as difficult as it sounds, but it's still not something I'm excited about. So instead of digging deeper into workarounds to support WAL, I'm thinking of an approach that is like WAL but can be implemented entirely in a VFS: using batch atomic writes to a companion file.

The idea is to tell SQLite to use batch atomic writes for write transactions, and have the VFS divert those writes to a separate file (essentially a WAL file, possibly with exactly the same header and envelope just to avoid reinventing the wheel). Then the VFS would implement the WAL logic, except for the part using shared memory.

What replaces the shared memory part? WAL uses shared memory for two things:
(1) providing a hash table for a fast database page locator, and (2) tracking the oldest transaction a reader considers current (which must not be overwritten by a checkpoint). For page location, I believe that the VFS can manage a separate page locator for each database connection, with updates delivered on each transaction via BroadcastChannel. For tracking which transactions readers consider current, I believe that each reader can acquire a Web Locks API shared lock whose name includes the WAL frame index, and then LockManager.query() can be used to scan for the earliest such lock still held.

There will be a few corner cases to handle, especially what happens for writes outside a batch atomic batch (an easy solution is to perform a full checkpoint first), but those are the basics of the idea. Unless I'm missing something, it doesn't really seem that difficult.

How is this not the same as WAL? One difference is that WAL doesn't use a journal while a batch atomic transaction does keep an in-memory journal in the page cache (unnecessarily for our purposes), so you won't get the WAL-like behavior if the cache is too small.

What WAL advantages does this retain? OPFS with a rollback journal has higher write transaction overhead, including writing journal data and issuing 4 syncs per transaction. The outlined approach doesn't write a journal and requires 1 sync per transaction (or 0 if reduced durability is acceptable), plus a data copy and 2 syncs per checkpoint. In typical usage this should be faster even with a single connection. With multiple connections, readers don't block a writer and a writer doesn't block readers, except for some checkpointing operations.

Apparently, just when we get new browser features to make SQLite persistent storage easy, I need to find a way to make it complicated again. 😛

rhashimoto · 2023-10-18T05:46:34Z

rhashimoto
Oct 18, 2023
Maintainer Author

I'm thinking of an approach that is like WAL but can be implemented entirely in a VFS: using batch atomic writes to a companion file.

I implemented this in a private branch, and it's working! I'm calling it FLOOR: Fragmented Logging On OPFS Realized. 😀

Unless I'm missing something, it doesn't really seem that difficult.

This was a bit overconfident. I certainly wouldn't describe building it as easy. Part of it was starting with a design that mimicked the SQLite C design too much, when I really needed to rethink the whole approach for the web platform. Another part was just the general trickiness of how a VFS figures out what SQLite is doing at a high level based on the stream of low level method calls - e.g. using xLock/xUnlock to infer transaction boundaries doesn't work. And finally, Chrome Dev Tools crashed a lot, I'm guessing because of JSPI, which made debugging difficult.

My revised design uses IndexedDB for write-ahead log metadata, while the write-ahead log page data goes into an OPFS file. This works really well. IndexedDB is atomic, consistent, and ordered, which simplifies the logic and reduces blocking the main thread. IndexedDB doesn't require fixed-size records so FLOOR can work with transaction indices instead of WAL frame indices, and that gives FLOOR a feature that WAL doesn't have: no checkpointing needed in most cases.

For typical usage, FLOOR readers should never block a writer, and a writer should never block a reader. Real WAL is the same most of the time, but to keep the write-ahead log from growing forever a full checkpoint occasionally needs to block everyone in order to reset the log. Because FLOOR retains transaction knowledge in its WAL lookup data structures, it doesn't need to write the log file sequentially so it can reuse the space no longer needed by any connection, hence the F for "fragmented" in FLOOR. The FLOOR log should not grow unbounded, except in some degenerate cases (e.g. a connection keeps its read transaction open forever) which applications can hopefully avoid (an explicit PRAGMA wal_checkpoint(TRUNCATE) will address this).

I don't think using IndexedDB will be a huge drag on performance, though actual performance measurements are still TBD. FLOOR doesn't need to sync on IndexedDB writes, it only relies on ordering, so those should be fast. IndexedDB reads are incrementally and intelligently fetched - only the metadata for transactions committed by other connections since the previous lock - and the transaction metadata isn't large so I would hope it can fit in the IndexedDB cache for most applications.

This isn't ready for anyone to use - it's a proof of concept that only works on Chrome Canary with the right flags enabled - so I don't know what I'm going to do with it. I do find it pretty interesting and entertaining how WAL-like a VFS can be without actually supporting the library WAL, including no external rollback journal, fewer flushes, and better concurrency. When all the pieces are mature, I think some variation of this will be the best solution for SQLite persistence unless you need to run in a context without FileSystemSyncAccessHandle (i.e. anything except a dedicated Worker).

0 replies

rhashimoto · 2023-10-25T04:05:26Z

rhashimoto
Oct 25, 2023
Maintainer Author

Update 1/31/24: Some of the information in this post is out of date. I reimplemented FLOOR from scratch, and a new demo is here with Asyncify or here with JSPI.

Online FLOOR proof-of-concept demo

IMPORTANT! This demo runs on Chrome 120 (currently in the Canary channel) with these Chrome flags enabled:

Experimental WebAssembly JavaScript Promise Integration (JSPI) - #enable-experimental-webassembly-stack-switching
File system lock modes - #file-system-access-locking-scheme

These browser features are in development and are unstable (Aw Snap! crashes in garbage collection are common at Chrome 120.0.6079.0), but the demo page is online here.

What to try

Multiple browser tabs can be open at once. If you step through multi-statement transactions (e.g. beginning with BEGIN and ending with COMMIT) on separate tabs then you can verify that a writer doesn't block readers, and readers don't block a writer or other readers (a writer will block other writers). Be aware that executing just a BEGIN statement doesn't lock the database and so fix the view of its contents at a point in time - that won't happen until a subsequent statement.

If you open the Dev Tools console, every VFS method call is logged there at the "Info" level along with
other FLOOR state at the "Debug" level. Useful tip: If you set a logging breakpoint in the Worker on src/examples/tag.js:44 and log "sqlite3.sql(stmt)" then that will output the SQL statement to the console as well. Here's a sample of console output for one SQL INSERT (that also triggered a checkpoint):

20:45:05.308 tag.js:44 INSERT OR REPLACE INTO foo VALUES (1), (2), (3);
20:45:05.308 OPFS.js:1 xLock /demo-opfs 1
20:45:05.308 OPFS.js:1 acquired read lock {"mode":"shared"}
20:45:05.309 OPFS.js:1 acquired tx lock 2
20:45:05.309 OPFS.js:1 xAccess /demo-opfs-journal 0x0
20:45:05.310 OPFS.js:1 xRead /demo-opfs 16 24
20:45:05.310 OPFS.js:1 xAccess /demo-opfs-wal 0x0
20:45:05.311 OPFS.js:1 xFileSize /demo-opfs
20:45:05.311 OPFS.js:1 size=12288
20:45:05.311 OPFS.js:1 xLock /demo-opfs 2
20:45:05.312 OPFS.js:1 acquired write lock {"mode":"exclusive"}
20:45:05.312 OPFS.js:1 releasing tx lock 2
20:45:05.312 OPFS.js:1 xDeviceCharacteristics /demo-opfs
20:45:05.312 OPFS.js:1 xFileControl /demo-opfs 20
20:45:05.312 OPFS.js:1 xDeviceCharacteristics /demo-opfs
20:45:05.313 OPFS.js:1 xLock /demo-opfs 4
20:45:05.313 OPFS.js:1 xDeviceCharacteristics /demo-opfs
20:45:05.313 OPFS.js:1 xDeviceCharacteristics /demo-opfs
20:45:05.313 OPFS.js:1 xFileControl /demo-opfs BEGIN_ATOMIC_WRITE
20:45:05.313 OPFS.js:1 xWrite /demo-opfs 4096 0
20:45:05.314 OPFS.js:1 write page 1 to WAL 0
20:45:05.314 OPFS.js:1 digest=fc94d0991f176983
20:45:05.314 OPFS.js:1 xWrite /demo-opfs 4096 4096
20:45:05.314 OPFS.js:1 write page 2 to WAL 1
20:45:05.314 OPFS.js:1 digest=0ca3e7705ff64a98
20:45:05.314 OPFS.js:1 xWrite /demo-opfs 4096 8192
20:45:05.315 OPFS.js:1 write page 3 to WAL 2
20:45:05.315 OPFS.js:1 digest=0d91e76c61e1467c
20:45:05.315 OPFS.js:1 xFileControl /demo-opfs COMMIT_ATOMIC_WRITE
20:45:05.315 OPFS.js:1 xFileControl /demo-opfs SYNC
20:45:05.319 OPFS.js:1 checkpoint WAL up to txCount=Infinity
20:45:05.321 OPFS.js:1 checkpoint txId=2 page=1 index=0
20:45:05.321 OPFS.js:1 checkpoint txId=2 page=2 index=1
20:45:05.322 OPFS.js:1 checkpoint txId=2 page=3 index=2
20:45:05.325 OPFS.js:1 checkpoint complete, WAL has 0 tx 3 free frames
20:45:05.325 OPFS.js:1 xSync /demo-opfs 0x2
20:45:05.325 OPFS.js:1 xFileControl /demo-opfs 22
20:45:05.325 OPFS.js:1 xUnlock /demo-opfs 1
20:45:05.325 OPFS.js:1 releasing write lock {"mode":"exclusive"}
20:45:05.325 OPFS.js:1 xDeviceCharacteristics /demo-opfs
20:45:05.326 OPFS.js:1 xUnlock /demo-opfs 0
20:45:05.326 OPFS.js:1 releasing read lock {"mode":"shared"}
20:45:05.326 OPFS.js:1 releasing tx lock 2

PRAGMA settings

FLOOR intercepts some of the pre-existing SQLite PRAGMA statements.

PRAGMA wal_autocheckpoint=N where N is a number will invoke a partial checkpoint after any write transaction when there are N or more transactions in the write-ahead log. A partial checkpoint does not block readers. The default value is 128. Note that this is different from SQLite WAL where this performs a full checkpoint every N pages, and the default is 1000. I found a way to update the WAL state incrementally so partial checkpoints are unnecessary.

PRAGMA wal_checkpoint (or PRAGMA wal_checkpoint(full)) invokes a full checkpoint immediately, copying all pages from the WAL log to the database file. Any use of "wal_checkpoint" must not be submitted within a transaction. This does not release any file system space.

PRAGMA wal_checkpoint(truncate) invokes a full checkpoint and truncates the write-ahead log file to 0 bytes. This can reclaim file system space if some abnormal usage pattern (like a read transaction that remains open across a huge number of write transactions) has caused the log file to be too large.

PRAGMA synchronous=normal omits flushing the write-ahead log file at the end of a write transaction. This gains some write performance at the cost of reducing durability. The default value is "full", which does not make that trade-off. I think the potential performance gain here is substantial, but I haven't confirmed that because Chrome isn't yet stable enough to run my benchmarks. The crash bug I filed has been assigned to someone and any fix will show up overnight in Canary so I'm hopeful this could change soon.

FLOOR doesn't treat PRAGMA locking_mode=exclusive in any special way but this setting can be used to maximize performance when you can accept only a single connection accessing the database (all others will be blocked for the duration of the exclusive lock). SQLite has one quirk with exclusive locking to look out for: in a newly created database, any write transactions in the very first locking session seem to use a rollback journal (instead of a batch atomic write). This means it is best not to activate exclusive locking until you know the database contains the results of at least one write transaction. Also note that SQLite doesn't actually end exclusive locking after you set PRAGMA locking_mode=normal until the next (read or write) transaction completes.

Reset

To remove all persistent state and reset the page completely, close all tabs and load the page with the URL search parameter "clear", i.e.:

https://rhashimoto.github.io/floor-demo/demo/?clear

This will let you start completely over with an empty file system and IndexedDB. Note that this will likely hang or fail if any other tabs are open - if that happens, close the other tabs and retry the clear.

0 replies

maceip · 2023-10-25T15:15:44Z

maceip
Oct 25, 2023

this is such badass work -- thanks for your relentless exploration! way above my skill level.

I tested https://rhashimoto.github.io/floor-demo/demo/benchmarks on:

MacOS Chrome Canary 120.0.6088.2 (arm64)
Android Chrome Dev 120.0.6073.4

*Android had much better stability, yet it would crash consistently on test 9

On MacOS Canary, it never made it past the second test, e.g.:

3 replies

rhashimoto Oct 25, 2023
Maintainer Author

Thanks for those data points! If you can, please make sure Chrome Canary crash reporting is enabled as that might help to let the Chromium folks know someone is trying to use JSPI with a real application. I'm super curious to see what the FLOOR benchmark results look like when they actually run.

maceip Oct 25, 2023

Thanks for those data points! If you can, please make sure Chrome Canary crash reporting is enabled as that might help to let the Chromium folks know someone is trying to use JSPI with a real application.

crash reporting on!

I'm super curious to see what the FLOOR benchmark results look like when they actually run.

I plan on doing more thorough testing, but on the unmodified benchmark:

Android (Pixel 7) took ~9 seconds to do Test1
M1 Macbook Pro took ~40 seconds for Test1

rhashimoto Oct 25, 2023
Maintainer Author

Thanks! Actually I didn't mean that I wanted to you do that (but by all means do if you're interested, too); I meant that I was eager to run them myself and compare with IDBBatchAtomicVFS and AccessHandlePoolVFS on the same machine. I'll need to disable all that logging to get an apples-to-apples comparison. But unless I get impatient I'm going to wait until the crashing subsides.

rhashimoto · 2023-11-05T21:57:11Z

rhashimoto
Nov 5, 2023
Maintainer Author

Chrome Canary's JSPI implementation is still too unstable to run benchmarks without crashing the page. Some of the numbers it did produce before crashing looked terrible, so I built FLOOR for Asyncify to see if those flashes of poor performance was a FLOOR problem or a Chrome problem. So far it looks like the performance issues are in Chrome's JSPI, as the Asyncify runs look reasonable. Here's a screenshot of Asyncify FLOOR performance compared with other contenders:

Note that I tweaked the PRAGMA configuration just a bit:

PRAGMA synchronous=normal enables the FLOOR reduced durability mode. SQLite WAL is often used with this option, where it also provides reduced durability, and it is a good comparison for IDBBatchAtomicVFS in reduced durability mode.
PRAGMA cache_size=-1000 sets the cache to about 1mb, which is half the default. The database in this benchmark is just under 2mb so nearly all the reads come from the cache at the default setting, as we learned in another benchmark thread. Reducing the cache should result in measuring more reads.

Some notes about these results:

Test 1 shows us that FLOOR (with reduced durability) is still noticeably slower than IDBBatchAtomicVFS with reduced durability, though it is noticeably faster than AccessHandlePoolVFS. If your top priority is performance on lots of small write transactions, IDBBatchAtomicVFS is the best. If you only need one connection and can enable exclusive locking then FLOOR performance will improve but so will IDBBatchAtomicVFS (the measurements in the linked post are from a different machine).
On the other tests, FLOOR generally is a bit slower than AccessHandlePoolVFS. This could have gone either way. FLOOR doesn't use a journal file and skips file flushes that AccessHandlePoolVFS requires, advantages for FLOOR. AccessHandlePoolVFS makes only synchronous VFS method calls and doesn't implement locking at all, advantages for AccessHandlePoolVFS. I'm hoping that FLOOR can close the gap when JSPI replaces Asyncify.
Don't forget that AccessHandlePoolVFS allows only one active connection. In any multi-connection test, FLOOR should provide the best concurrency while AccessHandlePoolVFS won't even run.
Tests 4 and 5 do the most reading, and this is where the OPFS VFS classes outshine IndexedDB. Test 7 looks like it might do a lot of reading, but it turns out to run entirely in the cache.
Note how configuring SQLite with ample cache is very important for optimizing performance. Also note that when one connection writes to the database, it invalidates the cache for all other connections.
Test 9 and 10 do the most writing to a journal in a single transaction. I would expect FLOOR to win there since it shouldn't use a journal, especially in Test 10 which writes ~3x more to the journal than to the database, but AccessHandlePoolVFS wins quite convincingly. I might poke around to see if I can figure that out.

1 reply

rhashimoto Nov 6, 2023
Maintainer Author

Test 9 and 10 do the most writing to a journal in a single transaction. I would expect FLOOR to win there since it shouldn't use a journal, especially in Test 10 which writes ~3x more to the journal than to the database, but AccessHandlePoolVFS wins quite convincingly.

Part of the reason FLOOR is slower here is because it is writing a rollback journal to OPFS on these tests. Even if batch atomic mode is available, SQLite will fall back to a rollback journal file if there isn't room to keep the journal in the page cache. And that's what happened when I reduced the cache size in order to get more reads to happen.

A workaround is PRAGMA journal_mode=memory to force the journal to stay in memory. With a regular VFS this would invite database corruption, but with FLOOR this is still safe. The drawback with that is user-initiated rollbacks will be slower because the batch atomic rollback mechanism won't be used, but most people won't care about that.

rhashimoto · 2024-05-21T21:35:40Z

rhashimoto
May 21, 2024
Maintainer Author

FLOOR was a great start, but OPFSPermutedVFS is looking like a better performer on the same workloads. OPFSPermutedVFS takes key ideas from FLOOR - using IndexedDB for write-ahead log metadata, Web Locks to track the view state of other connections, and non-sequential logging - and adds the lazy locking of OPFSAdaptiveVFS plus the fundamental idea of doing write-ahead directly to the database file.

OPFSPermutedVFS is more concurrent than FLOOR because read transactions don't access IndexedDB at all (otherwise reads could be temporarily blocked waiting for a IndexedDB write transaction to commit). In addition, OPFSPermutedVFS writes database pages only once, while FLOOR writes to the log and later copies pages from the log to the database file on a checkpoint.

FLOOR also currently has a race condition bug that appears under heavy stress testing. I haven't yet been able to figure that out, and the ascendance of OPFSPermutedVFS makes fixing it much less important. So I have removed FLOOR from the repo at least for now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Off the WAL, on the FLOOR #116

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Off the WAL, on the FLOOR #116

rhashimoto Oct 12, 2023 Maintainer

Replies: 5 comments · 4 replies

rhashimoto Oct 18, 2023 Maintainer Author

rhashimoto Oct 25, 2023 Maintainer Author

Online FLOOR proof-of-concept demo

What to try

PRAGMA settings

Reset

maceip Oct 25, 2023

rhashimoto Oct 25, 2023 Maintainer Author

maceip Oct 25, 2023

rhashimoto Oct 25, 2023 Maintainer Author

rhashimoto Nov 5, 2023 Maintainer Author

rhashimoto Nov 6, 2023 Maintainer Author

rhashimoto May 21, 2024 Maintainer Author

rhashimoto
Oct 12, 2023
Maintainer

Replies: 5 comments 4 replies

rhashimoto
Oct 18, 2023
Maintainer Author

rhashimoto
Oct 25, 2023
Maintainer Author

maceip
Oct 25, 2023

rhashimoto Oct 25, 2023
Maintainer Author

rhashimoto Oct 25, 2023
Maintainer Author

rhashimoto
Nov 5, 2023
Maintainer Author

rhashimoto Nov 6, 2023
Maintainer Author

rhashimoto
May 21, 2024
Maintainer Author