Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213

Merged
merged 4 commits into from
Sep 20, 2023

Conversation

tserong
Copy link
Member

@tserong tserong commented Sep 19, 2023

By default, sqlite's automatic checkpoints run when the WAL goes over
~1000 pages (~4MB), and the WAL is recycled, i.e. sqlite overwrites
the WAL from the start again. So in normal operation, we should never
see a WAL > ~4MB. Unfortunately this is not the case when we are
under heavy concurrent write load - we have seen WALs explode out to
several tens of gigabytes, because even though checkpoints are
happening, if we get another write while a checkpoint is ongoing, it
will just append to the WAL. Pile on enough write load and this just
keeps happening. Three or four concurrent runs of the "fill" operation
from the probe tool in https://github.com/giubacc/s3gw-ha-research will
do it.

Once the heavy write load stops, the WAL remains enormous because
the default checkpoints never truncate the file, even though it'll
go back to recycling the first 4MB or so when under normal load.
I tried setting PRAGMA journal_size_limit = 4194304; (4MB) and that
kinda works, i.e. if the write load drops off, the automatic checkpoint
will later trunate the file to 4MB. But this doesn't help to stop the
initial size explosion, so it's not actually a solution (sustained
heavy write load could still run the system out of disk space).

My fix here is to use sqlite3_wal_hook() to add our own callback
in place of the default checkpoint mechanism. This implementation will
try to behave the same as the default, i.e. it will do a passive
checkpoint once the WAL goes over 1000 pages (~4MB), but in addition
if the WAL goes over 4000 pages (~16MB), it will attempt a truncating
checkpoint to bring the WAL back to journal_size_limit (4MB).

Under test with 4-8 concurrent writers, sometimes the truncating
checkpoints succeed, and sometimes they fail because the database is
locked by a writer. But, they succeed often enough that the WAL
seems to reliably float between 4 and 16MB (as observed by running
watch ls -lh s3gw.db*) rather than getting crazy big.

The limits I've chosen here are somewhat arbitrary. The 4MB for
passive checkpoints is because that what sqlite's default checkpoint
does. The 16MB for truncating checkpoints is just so we have some
significantly larger point at which we do the more expensive truncate.
Possibly these values should be made configurable, which would allow
benchmarking to be done to test different values under different
types of load (see https://github.com/aquarist-labs/s3gw/issues/681).

I should also mention I came across "wal2" mode, which solves the
problem a different way, by having two WAL files:

https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

Alas, this is not in mainline sqlite, and I have no idea when it will
be, so we can't use it. The most recent forum post I could find on
this topic just says it's "currently limited to various non-trunk
branches of the source tree":

https://sqlite.org/forum/forumpost/6f787ade450364bdc5f7515a50d9afc2007edf45fe305ea0d0c9420d7bafa5e6

We should probably look at wal2 in future though once it lands.

(NOTE: this PR also moves the DBConn constructor from dbconn.h to dbconn.cc, because I got irritated by compile times while messing with this stuff)

I'm doing this because every time I tweak something in the DBConn
constructor when fiddling with SQLite, I have to recompile a whole
lot of unrelated stuff because dbconn.h changed.  This is very
annoying.

Signed-off-by: Tim Serong <[email protected]>
By default, sqlite's automatic checkpoints run when the WAL goes over
~1000 pages (~4MB), and the WAL is recycled, i.e. sqlite overwrites
the WAL from the start again.  So in normal operation, we should never
see a WAL > ~4MB.  Unfortunately this is not the case when we are
under heavy concurrent write load - we have seen WALs explode out to
several tens of gigabytes, because even though checkpoints are
happening, if we get another write while a checkpoint is ongoing, it
will just append to the WAL.  Pile on enough write load and this just
keeps happening.  Three or four concurrent runs of the "fill" operation
from the probe tool in https://github.com/giubacc/s3gw-ha-research will
do it.

Once the heavy write load stops, the WAL remains enormous because
the default checkpoints never truncate the file, even though it'll
go back to recycling the first 4MB or so when under normal load.
I tried setting `PRAGMA journal_size_limit = 4194304;` (4MB) and that
kinda works, i.e. if the write load drops off, the automatic checkpoint
will later trunate the file to 4MB.  But this doesn't help to stop the
initial size explosion, so it's not actually a solution (sustained
heavy write load could still run the system out of disk space).

My fix here is to use sqlite3_wal_hook() to add our own callback
in place of the default checkpoint mechanism.  This implementation will
try to behave the same as the default, i.e. it will do a passive
checkpoint once the WAL goes over 1000 pages (~4MB), but in addition
if the WAL goes over 4000 pages (~16MB), it will attempt a truncating
checkpoint to bring the WAL back to journal_size_limit (4MB).

Under test with 4-8 concurrent writers, sometimes the truncating
checkpoints succeed, and sometimes they fail because the database is
locked by a writer.  But, they succeed often enough that the WAL
seems to reliably float between 4 and 16MB (as observed by running
`watch ls -lh s3gw.db*`) rather than getting crazy big.

The limits I've chosen here are somewhat arbitrary.  The 4MB for
passive checkpoints is because that what sqlite's default checkpoint
does.  The 16MB for truncating checkpoints is just so we have some
significantly larger point at which we do the more expensive truncate.
Possibly these values should be made configurable, which would allow
benchmarking to be done to test different values under different
types of load (see https://github.com/aquarist-labs/s3gw/issues/681).

I should also mention I came across "wal2" mode, which solves the
problem a different way, by having two WAL files:

  https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

Alas, this is not in mainline sqlite, and I have no idea when it will
be, so we can't use it.  The most recent forum post I could find on
this topic just says it's "currently limited to various non-trunk
branches of the source tree":

  https://sqlite.org/forum/forumpost/6f787ade450364bdc5f7515a50d9afc2007edf45fe305ea0d0c9420d7bafa5e6

We should probably look at wal2 in future though once it lands.

Signed-off-by: Tim Serong <[email protected]>
@tserong
Copy link
Member Author

tserong commented Sep 19, 2023

One other thing to consider: I've been testing a change to have s3gw use only one connection to the SQLite database, rather than one per worker thread like we do now (#209). With that change applied instead of this one, the WAL never goes over ~5MB even under heavy write load.

@giubacc
Copy link

giubacc commented Sep 19, 2023

Thanks for the patch's explanation, really clear.
The patch seems good to me; only I'd add the parameters to leave a bit of flexibility over the values involved (I'd add those directly with this PR).
Can we unit-test this patch someway (?), maybe directly with a tight loop calling the sfs functions to put objects ?

@tserong
Copy link
Member Author

tserong commented Sep 19, 2023

Thanks for the patch's explanation, really clear. The patch seems good to me; only I'd add the parameters to leave a bit of flexibility over the values involved (I'd add those directly with this PR).

Cool, will do.

Can we unit-test this patch someway (?), maybe directly with a tight loop calling the sfs functions to put objects ?

Probably. I'll see what I can come up with :-)

@tserong tserong force-pushed the wip-wal-checkpoint branch 2 times, most recently from bca20ff to 6c6f5c5 Compare September 20, 2023 07:05
Copy link

@giubacc giubacc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small suggestion in rgw.yaml.in

// If this test ever *fails* it means the WAL growth problem
// has been unexpectedly fixed by some other change that
// doesn't involve our SFS checkpoint mechanism.
TEST_F(TestSFSWALCheckpoint, confirm_wal_explosion) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart test ;) 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :-)

is 4194304, i.e. 4MB. Set this to -1 for no limit (which is SQLite's
default).
service:
- rgw
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: add a blank line at the end :)

This adds two tests, one to confirm that we have a problem with WAL
growth, and the second to verify that we've fixed it.

Signed-off-by: Tim Serong <[email protected]>
Copy link

@0xavi0 0xavi0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants