Off the WAL, on the FLOOR #116
Replies: 5 comments 4 replies
-
I implemented this in a private branch, and it's working! I'm calling it FLOOR: Fragmented Logging On OPFS Realized. 😀
This was a bit overconfident. I certainly wouldn't describe building it as easy. Part of it was starting with a design that mimicked the SQLite C design too much, when I really needed to rethink the whole approach for the web platform. Another part was just the general trickiness of how a VFS figures out what SQLite is doing at a high level based on the stream of low level method calls - e.g. using xLock/xUnlock to infer transaction boundaries doesn't work. And finally, Chrome Dev Tools crashed a lot, I'm guessing because of JSPI, which made debugging difficult. My revised design uses IndexedDB for write-ahead log metadata, while the write-ahead log page data goes into an OPFS file. This works really well. IndexedDB is atomic, consistent, and ordered, which simplifies the logic and reduces blocking the main thread. IndexedDB doesn't require fixed-size records so FLOOR can work with transaction indices instead of WAL frame indices, and that gives FLOOR a feature that WAL doesn't have: no checkpointing needed in most cases. For typical usage, FLOOR readers should never block a writer, and a writer should never block a reader. Real WAL is the same most of the time, but to keep the write-ahead log from growing forever a full checkpoint occasionally needs to block everyone in order to reset the log. Because FLOOR retains transaction knowledge in its WAL lookup data structures, it doesn't need to write the log file sequentially so it can reuse the space no longer needed by any connection, hence the F for "fragmented" in FLOOR. The FLOOR log should not grow unbounded, except in some degenerate cases (e.g. a connection keeps its read transaction open forever) which applications can hopefully avoid (an explicit I don't think using IndexedDB will be a huge drag on performance, though actual performance measurements are still TBD. FLOOR doesn't need to sync on IndexedDB writes, it only relies on ordering, so those should be fast. IndexedDB reads are incrementally and intelligently fetched - only the metadata for transactions committed by other connections since the previous lock - and the transaction metadata isn't large so I would hope it can fit in the IndexedDB cache for most applications. This isn't ready for anyone to use - it's a proof of concept that only works on Chrome Canary with the right flags enabled - so I don't know what I'm going to do with it. I do find it pretty interesting and entertaining how WAL-like a VFS can be without actually supporting the library WAL, including no external rollback journal, fewer flushes, and better concurrency. When all the pieces are mature, I think some variation of this will be the best solution for SQLite persistence unless you need to run in a context without FileSystemSyncAccessHandle (i.e. anything except a dedicated Worker). |
Beta Was this translation helpful? Give feedback.
-
Update 1/31/24: Some of the information in this post is out of date. I reimplemented FLOOR from scratch, and a new demo is here with Asyncify or here with JSPI. Online FLOOR proof-of-concept demoIMPORTANT! This demo runs on Chrome 120 (currently in the Canary channel) with these Chrome flags enabled:
These browser features are in development and are unstable (Aw Snap! crashes in garbage collection are common at Chrome 120.0.6079.0), but the demo page is online here. What to tryMultiple browser tabs can be open at once. If you step through multi-statement transactions (e.g. beginning with If you open the Dev Tools console, every VFS method call is logged there at the "Info" level along with
PRAGMA settingsFLOOR intercepts some of the pre-existing SQLite PRAGMA statements.
FLOOR doesn't treat ResetTo remove all persistent state and reset the page completely, close all tabs and load the page with the URL search parameter "clear", i.e.: https://rhashimoto.github.io/floor-demo/demo/?clear This will let you start completely over with an empty file system and IndexedDB. Note that this will likely hang or fail if any other tabs are open - if that happens, close the other tabs and retry the clear. |
Beta Was this translation helpful? Give feedback.
-
this is such badass work -- thanks for your relentless exploration! way above my skill level. I tested https://rhashimoto.github.io/floor-demo/demo/benchmarks on:
*Android had much better stability, yet it would crash consistently on test 9 On MacOS Canary, it never made it past the second test, e.g.: |
Beta Was this translation helpful? Give feedback.
-
Chrome Canary's JSPI implementation is still too unstable to run benchmarks without crashing the page. Some of the numbers it did produce before crashing looked terrible, so I built FLOOR for Asyncify to see if those flashes of poor performance was a FLOOR problem or a Chrome problem. So far it looks like the performance issues are in Chrome's JSPI, as the Asyncify runs look reasonable. Here's a screenshot of Asyncify FLOOR performance compared with other contenders: Note that I tweaked the PRAGMA configuration just a bit:
Some notes about these results:
|
Beta Was this translation helpful? Give feedback.
-
FLOOR was a great start, but OPFSPermutedVFS is looking like a better performer on the same workloads. OPFSPermutedVFS takes key ideas from FLOOR - using IndexedDB for write-ahead log metadata, Web Locks to track the view state of other connections, and non-sequential logging - and adds the lazy locking of OPFSAdaptiveVFS plus the fundamental idea of doing write-ahead directly to the database file. OPFSPermutedVFS is more concurrent than FLOOR because read transactions don't access IndexedDB at all (otherwise reads could be temporarily blocked waiting for a IndexedDB write transaction to commit). In addition, OPFSPermutedVFS writes database pages only once, while FLOOR writes to the log and later copies pages from the log to the database file on a checkpoint. FLOOR also currently has a race condition bug that appears under heavy stress testing. I haven't yet been able to figure that out, and the ascendance of OPFSPermutedVFS makes fixing it much less important. So I have removed FLOOR from the repo at least for now. |
Beta Was this translation helpful? Give feedback.
-
For anyone unfamiliar with this post subject wordplay:
In discussing a prototype OPFS VFS using experimental browser features in Chrome, I speculated:
I was hoping that actual shared-address-space memory was not required, and data could just be merged in/out whenever the lock/unlock/barrier methods were called. After reading through the SQLite WAL source, however, I'm thinking that might not be sufficient. SQLite uses atomic loads and stores on the shared memory buffer that bypass locking, and without understanding this fairly complicated piece of code a lot better than I do, I'm not confident that waiting to synchronize until a method is called is safe.
The web platform does support sharing WebAssembly memory so that is one implementation path. But going that route would also incur restrictive and annoying security constraints that I really want to avoid.
How do we work around that? Well, we could patch the SQLite source to make sure it calls us to synchronize when doing those atomic loads and stores. I think that's not as difficult as it sounds, but it's still not something I'm excited about. So instead of digging deeper into workarounds to support WAL, I'm thinking of an approach that is like WAL but can be implemented entirely in a VFS: using batch atomic writes to a companion file.
The idea is to tell SQLite to use batch atomic writes for write transactions, and have the VFS divert those writes to a separate file (essentially a WAL file, possibly with exactly the same header and envelope just to avoid reinventing the wheel). Then the VFS would implement the WAL logic, except for the part using shared memory.
What replaces the shared memory part? WAL uses shared memory for two things:
(1) providing a hash table for a fast database page locator, and (2) tracking the oldest transaction a reader considers current (which must not be overwritten by a checkpoint). For page location, I believe that the VFS can manage a separate page locator for each database connection, with updates delivered on each transaction via BroadcastChannel. For tracking which transactions readers consider current, I believe that each reader can acquire a Web Locks API shared lock whose name includes the WAL frame index, and then LockManager.query() can be used to scan for the earliest such lock still held.
There will be a few corner cases to handle, especially what happens for writes outside a batch atomic batch (an easy solution is to perform a full checkpoint first), but those are the basics of the idea. Unless I'm missing something, it doesn't really seem that difficult.
How is this not the same as WAL? One difference is that WAL doesn't use a journal while a batch atomic transaction does keep an in-memory journal in the page cache (unnecessarily for our purposes), so you won't get the WAL-like behavior if the cache is too small.
What WAL advantages does this retain? OPFS with a rollback journal has higher write transaction overhead, including writing journal data and issuing 4 syncs per transaction. The outlined approach doesn't write a journal and requires 1 sync per transaction (or 0 if reduced durability is acceptable), plus a data copy and 2 syncs per checkpoint. In typical usage this should be faster even with a single connection. With multiple connections, readers don't block a writer and a writer doesn't block readers, except for some checkpointing operations.
Apparently, just when we get new browser features to make SQLite persistent storage easy, I need to find a way to make it complicated again. 😛
Beta Was this translation helpful? Give feedback.
All reactions