-
Notifications
You must be signed in to change notification settings - Fork 10
rgw/sfs: dbconn: add WAL checkpoints, truncate if over 16MB #213
Conversation
I'm doing this because every time I tweak something in the DBConn constructor when fiddling with SQLite, I have to recompile a whole lot of unrelated stuff because dbconn.h changed. This is very annoying. Signed-off-by: Tim Serong <[email protected]>
By default, sqlite's automatic checkpoints run when the WAL goes over ~1000 pages (~4MB), and the WAL is recycled, i.e. sqlite overwrites the WAL from the start again. So in normal operation, we should never see a WAL > ~4MB. Unfortunately this is not the case when we are under heavy concurrent write load - we have seen WALs explode out to several tens of gigabytes, because even though checkpoints are happening, if we get another write while a checkpoint is ongoing, it will just append to the WAL. Pile on enough write load and this just keeps happening. Three or four concurrent runs of the "fill" operation from the probe tool in https://github.com/giubacc/s3gw-ha-research will do it. Once the heavy write load stops, the WAL remains enormous because the default checkpoints never truncate the file, even though it'll go back to recycling the first 4MB or so when under normal load. I tried setting `PRAGMA journal_size_limit = 4194304;` (4MB) and that kinda works, i.e. if the write load drops off, the automatic checkpoint will later trunate the file to 4MB. But this doesn't help to stop the initial size explosion, so it's not actually a solution (sustained heavy write load could still run the system out of disk space). My fix here is to use sqlite3_wal_hook() to add our own callback in place of the default checkpoint mechanism. This implementation will try to behave the same as the default, i.e. it will do a passive checkpoint once the WAL goes over 1000 pages (~4MB), but in addition if the WAL goes over 4000 pages (~16MB), it will attempt a truncating checkpoint to bring the WAL back to journal_size_limit (4MB). Under test with 4-8 concurrent writers, sometimes the truncating checkpoints succeed, and sometimes they fail because the database is locked by a writer. But, they succeed often enough that the WAL seems to reliably float between 4 and 16MB (as observed by running `watch ls -lh s3gw.db*`) rather than getting crazy big. The limits I've chosen here are somewhat arbitrary. The 4MB for passive checkpoints is because that what sqlite's default checkpoint does. The 16MB for truncating checkpoints is just so we have some significantly larger point at which we do the more expensive truncate. Possibly these values should be made configurable, which would allow benchmarking to be done to test different values under different types of load (see https://github.com/aquarist-labs/s3gw/issues/681). I should also mention I came across "wal2" mode, which solves the problem a different way, by having two WAL files: https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md Alas, this is not in mainline sqlite, and I have no idea when it will be, so we can't use it. The most recent forum post I could find on this topic just says it's "currently limited to various non-trunk branches of the source tree": https://sqlite.org/forum/forumpost/6f787ade450364bdc5f7515a50d9afc2007edf45fe305ea0d0c9420d7bafa5e6 We should probably look at wal2 in future though once it lands. Signed-off-by: Tim Serong <[email protected]>
f4e31fb
to
2687437
Compare
One other thing to consider: I've been testing a change to have s3gw use only one connection to the SQLite database, rather than one per worker thread like we do now (#209). With that change applied instead of this one, the WAL never goes over ~5MB even under heavy write load. |
Thanks for the patch's explanation, really clear. |
Cool, will do.
Probably. I'll see what I can come up with :-) |
bca20ff
to
6c6f5c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a small suggestion in rgw.yaml.in
// If this test ever *fails* it means the WAL growth problem | ||
// has been unexpectedly fixed by some other change that | ||
// doesn't involve our SFS checkpoint mechanism. | ||
TEST_F(TestSFSWALCheckpoint, confirm_wal_explosion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smart test ;) 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :-)
src/common/options/rgw.yaml.in
Outdated
is 4194304, i.e. 4MB. Set this to -1 for no limit (which is SQLite's | ||
default). | ||
service: | ||
- rgw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: add a blank line at the end :)
Signed-off-by: Tim Serong <[email protected]>
This adds two tests, one to confirm that we have a problem with WAL growth, and the second to verify that we've fixed it. Signed-off-by: Tim Serong <[email protected]>
6c6f5c5
to
9e3096e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
By default, sqlite's automatic checkpoints run when the WAL goes over
~1000 pages (~4MB), and the WAL is recycled, i.e. sqlite overwrites
the WAL from the start again. So in normal operation, we should never
see a WAL > ~4MB. Unfortunately this is not the case when we are
under heavy concurrent write load - we have seen WALs explode out to
several tens of gigabytes, because even though checkpoints are
happening, if we get another write while a checkpoint is ongoing, it
will just append to the WAL. Pile on enough write load and this just
keeps happening. Three or four concurrent runs of the "fill" operation
from the probe tool in https://github.com/giubacc/s3gw-ha-research will
do it.
Once the heavy write load stops, the WAL remains enormous because
the default checkpoints never truncate the file, even though it'll
go back to recycling the first 4MB or so when under normal load.
I tried setting
PRAGMA journal_size_limit = 4194304;
(4MB) and thatkinda works, i.e. if the write load drops off, the automatic checkpoint
will later trunate the file to 4MB. But this doesn't help to stop the
initial size explosion, so it's not actually a solution (sustained
heavy write load could still run the system out of disk space).
My fix here is to use sqlite3_wal_hook() to add our own callback
in place of the default checkpoint mechanism. This implementation will
try to behave the same as the default, i.e. it will do a passive
checkpoint once the WAL goes over 1000 pages (~4MB), but in addition
if the WAL goes over 4000 pages (~16MB), it will attempt a truncating
checkpoint to bring the WAL back to journal_size_limit (4MB).
Under test with 4-8 concurrent writers, sometimes the truncating
checkpoints succeed, and sometimes they fail because the database is
locked by a writer. But, they succeed often enough that the WAL
seems to reliably float between 4 and 16MB (as observed by running
watch ls -lh s3gw.db*
) rather than getting crazy big.The limits I've chosen here are somewhat arbitrary. The 4MB for
passive checkpoints is because that what sqlite's default checkpoint
does. The 16MB for truncating checkpoints is just so we have some
significantly larger point at which we do the more expensive truncate.
Possibly these values should be made configurable, which would allow
benchmarking to be done to test different values under different
types of load (see https://github.com/aquarist-labs/s3gw/issues/681).
I should also mention I came across "wal2" mode, which solves the
problem a different way, by having two WAL files:
https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md
Alas, this is not in mainline sqlite, and I have no idea when it will
be, so we can't use it. The most recent forum post I could find on
this topic just says it's "currently limited to various non-trunk
branches of the source tree":
https://sqlite.org/forum/forumpost/6f787ade450364bdc5f7515a50d9afc2007edf45fe305ea0d0c9420d7bafa5e6
We should probably look at wal2 in future though once it lands.
(NOTE: this PR also moves the DBConn constructor from dbconn.h to dbconn.cc, because I got irritated by compile times while messing with this stuff)