rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213

tserong · 2023-09-19T07:33:37Z

By default, sqlite's automatic checkpoints run when the WAL goes over
~1000 pages (~4MB), and the WAL is recycled, i.e. sqlite overwrites
the WAL from the start again. So in normal operation, we should never
see a WAL > ~4MB. Unfortunately this is not the case when we are
under heavy concurrent write load - we have seen WALs explode out to
several tens of gigabytes, because even though checkpoints are
happening, if we get another write while a checkpoint is ongoing, it
will just append to the WAL. Pile on enough write load and this just
keeps happening. Three or four concurrent runs of the "fill" operation
from the probe tool in https://github.com/giubacc/s3gw-ha-research will
do it.

Once the heavy write load stops, the WAL remains enormous because
the default checkpoints never truncate the file, even though it'll
go back to recycling the first 4MB or so when under normal load.
I tried setting PRAGMA journal_size_limit = 4194304; (4MB) and that
kinda works, i.e. if the write load drops off, the automatic checkpoint
will later trunate the file to 4MB. But this doesn't help to stop the
initial size explosion, so it's not actually a solution (sustained
heavy write load could still run the system out of disk space).

My fix here is to use sqlite3_wal_hook() to add our own callback
in place of the default checkpoint mechanism. This implementation will
try to behave the same as the default, i.e. it will do a passive
checkpoint once the WAL goes over 1000 pages (~4MB), but in addition
if the WAL goes over 4000 pages (~16MB), it will attempt a truncating
checkpoint to bring the WAL back to journal_size_limit (4MB).

Under test with 4-8 concurrent writers, sometimes the truncating
checkpoints succeed, and sometimes they fail because the database is
locked by a writer. But, they succeed often enough that the WAL
seems to reliably float between 4 and 16MB (as observed by running
watch ls -lh s3gw.db*) rather than getting crazy big.

The limits I've chosen here are somewhat arbitrary. The 4MB for
passive checkpoints is because that what sqlite's default checkpoint
does. The 16MB for truncating checkpoints is just so we have some
significantly larger point at which we do the more expensive truncate.
Possibly these values should be made configurable, which would allow
benchmarking to be done to test different values under different
types of load (see https://github.com/aquarist-labs/s3gw/issues/681).

I should also mention I came across "wal2" mode, which solves the
problem a different way, by having two WAL files:

https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

Alas, this is not in mainline sqlite, and I have no idea when it will
be, so we can't use it. The most recent forum post I could find on
this topic just says it's "currently limited to various non-trunk
branches of the source tree":

https://sqlite.org/forum/forumpost/6f787ade450364bdc5f7515a50d9afc2007edf45fe305ea0d0c9420d7bafa5e6

We should probably look at wal2 in future though once it lands.

(NOTE: this PR also moves the DBConn constructor from dbconn.h to dbconn.cc, because I got irritated by compile times while messing with this stuff)

I'm doing this because every time I tweak something in the DBConn constructor when fiddling with SQLite, I have to recompile a whole lot of unrelated stuff because dbconn.h changed. This is very annoying. Signed-off-by: Tim Serong <[email protected]>

By default, sqlite's automatic checkpoints run when the WAL goes over ~1000 pages (~4MB), and the WAL is recycled, i.e. sqlite overwrites the WAL from the start again. So in normal operation, we should never see a WAL > ~4MB. Unfortunately this is not the case when we are under heavy concurrent write load - we have seen WALs explode out to several tens of gigabytes, because even though checkpoints are happening, if we get another write while a checkpoint is ongoing, it will just append to the WAL. Pile on enough write load and this just keeps happening. Three or four concurrent runs of the "fill" operation from the probe tool in https://github.com/giubacc/s3gw-ha-research will do it. Once the heavy write load stops, the WAL remains enormous because the default checkpoints never truncate the file, even though it'll go back to recycling the first 4MB or so when under normal load. I tried setting `PRAGMA journal_size_limit = 4194304;` (4MB) and that kinda works, i.e. if the write load drops off, the automatic checkpoint will later trunate the file to 4MB. But this doesn't help to stop the initial size explosion, so it's not actually a solution (sustained heavy write load could still run the system out of disk space). My fix here is to use sqlite3_wal_hook() to add our own callback in place of the default checkpoint mechanism. This implementation will try to behave the same as the default, i.e. it will do a passive checkpoint once the WAL goes over 1000 pages (~4MB), but in addition if the WAL goes over 4000 pages (~16MB), it will attempt a truncating checkpoint to bring the WAL back to journal_size_limit (4MB). Under test with 4-8 concurrent writers, sometimes the truncating checkpoints succeed, and sometimes they fail because the database is locked by a writer. But, they succeed often enough that the WAL seems to reliably float between 4 and 16MB (as observed by running `watch ls -lh s3gw.db*`) rather than getting crazy big. The limits I've chosen here are somewhat arbitrary. The 4MB for passive checkpoints is because that what sqlite's default checkpoint does. The 16MB for truncating checkpoints is just so we have some significantly larger point at which we do the more expensive truncate. Possibly these values should be made configurable, which would allow benchmarking to be done to test different values under different types of load (see https://github.com/aquarist-labs/s3gw/issues/681). I should also mention I came across "wal2" mode, which solves the problem a different way, by having two WAL files: https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md Alas, this is not in mainline sqlite, and I have no idea when it will be, so we can't use it. The most recent forum post I could find on this topic just says it's "currently limited to various non-trunk branches of the source tree": https://sqlite.org/forum/forumpost/6f787ade450364bdc5f7515a50d9afc2007edf45fe305ea0d0c9420d7bafa5e6 We should probably look at wal2 in future though once it lands. Signed-off-by: Tim Serong <[email protected]>

tserong · 2023-09-19T12:41:12Z

One other thing to consider: I've been testing a change to have s3gw use only one connection to the SQLite database, rather than one per worker thread like we do now (#209). With that change applied instead of this one, the WAL never goes over ~5MB even under heavy write load.

giubacc · 2023-09-19T14:45:25Z

Thanks for the patch's explanation, really clear.
The patch seems good to me; only I'd add the parameters to leave a bit of flexibility over the values involved (I'd add those directly with this PR).
Can we unit-test this patch someway (?), maybe directly with a tight loop calling the sfs functions to put objects ?

tserong · 2023-09-19T22:46:32Z

Thanks for the patch's explanation, really clear. The patch seems good to me; only I'd add the parameters to leave a bit of flexibility over the values involved (I'd add those directly with this PR).

Cool, will do.

Can we unit-test this patch someway (?), maybe directly with a tight loop calling the sfs functions to put objects ?

Probably. I'll see what I can come up with :-)

giubacc

LGTM, just a small suggestion in rgw.yaml.in

giubacc · 2023-09-20T07:25:57Z

src/test/rgw/sfs/test_rgw_sfs_wal_checkpoint.cc

+// If this test ever *fails* it means the WAL growth problem
+// has been unexpectedly fixed by some other change that
+// doesn't involve our SFS checkpoint mechanism.
+TEST_F(TestSFSWALCheckpoint, confirm_wal_explosion) {


smart test ;) 👍

giubacc · 2023-09-20T07:32:27Z

src/common/options/rgw.yaml.in

+    is 4194304, i.e. 4MB.  Set this to -1 for no limit (which is SQLite's
+    default).
+  service:
+    - rgw


suggestion: add a blank line at the end :)

Signed-off-by: Tim Serong <[email protected]>

This adds two tests, one to confirm that we have a problem with WAL growth, and the second to verify that we've fixed it. Signed-off-by: Tim Serong <[email protected]>

0xavi0

LGTM

tserong requested review from jecluis, irq0, 0xavi0 and giubacc September 19, 2023 07:33

tserong force-pushed the wip-wal-checkpoint branch from f4e31fb to 2687437 Compare September 19, 2023 07:41

tserong force-pushed the wip-wal-checkpoint branch 2 times, most recently from bca20ff to 6c6f5c5 Compare September 20, 2023 07:05

giubacc reviewed Sep 20, 2023

View reviewed changes

tserong added 2 commits September 20, 2023 17:43

rgw/sfs: dbconn: add WAL checkpoint config options

5b2535b

Signed-off-by: Tim Serong <[email protected]>

rgw/sfs: testing: add TestSFSWALCheckpoint

9e3096e

This adds two tests, one to confirm that we have a problem with WAL growth, and the second to verify that we've fixed it. Signed-off-by: Tim Serong <[email protected]>

tserong force-pushed the wip-wal-checkpoint branch from 6c6f5c5 to 9e3096e Compare September 20, 2023 07:43

0xavi0 approved these changes Sep 20, 2023

View reviewed changes

tserong merged commit 41f46c0 into aquarist-labs:s3gw Sep 20, 2023

tserong deleted the wip-wal-checkpoint branch September 20, 2023 08:20

tserong mentioned this pull request Oct 9, 2023

rgw/sfs: abstract how we obtain connections, potentially change between threading models s3gw-tech/s3gw#727

Closed

irq0 mentioned this pull request Oct 24, 2023

Understand and manage SQLite WAL growth s3gw-tech/s3gw#681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213

rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213

tserong commented Sep 19, 2023 •

edited

Loading

tserong commented Sep 19, 2023 •

edited

Loading

giubacc commented Sep 19, 2023

tserong commented Sep 19, 2023

giubacc left a comment

giubacc Sep 20, 2023

tserong Sep 20, 2023

giubacc Sep 20, 2023

0xavi0 left a comment

rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213

rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213

Conversation

tserong commented Sep 19, 2023 • edited Loading

tserong commented Sep 19, 2023 • edited Loading

giubacc commented Sep 19, 2023

tserong commented Sep 19, 2023

giubacc left a comment

Choose a reason for hiding this comment

giubacc Sep 20, 2023

Choose a reason for hiding this comment

tserong Sep 20, 2023

Choose a reason for hiding this comment

giubacc Sep 20, 2023

Choose a reason for hiding this comment

0xavi0 left a comment

Choose a reason for hiding this comment

tserong commented Sep 19, 2023 •

edited

Loading

tserong commented Sep 19, 2023 •

edited

Loading