-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw/sfs: abstract how we obtain connections, potentially change between threading models #727
Comments
In the interest of testing that, I applied Marcel's microbenchmark code (irq0/ceph@54ce922#diff-521007125cbd4106a8fc8043757557c75c013b766dee299df2df34824eb98b2a) to the current s3gw branch, and ran this on my (relatively old) 8 core desktop:
I also did the same with the single connection:
And the pool:
Observations:
I assume I'm not getting as good a result with the pool as Marcel, given I've got 8 cores and he's got 32. Also interesting to note that of the above three, the single connection test is the only one that doesn't see any retries or failures. Then I scaled the microbenchmarks down from one million to one hundred thousand queries, to cause my system to suffer a bit less. Here's what I got:
Note the fail and retry counts are lower here, and I'm starting to see the advantage of the pool over the single connection. TL;DR: the current implementation with it's gadzillion open/close calls really is noticeably slower than both the single connection and the pool. So I think this is definitely worth pursuing further. @irq0, if you get a chance, would you mind running those microbenchmarks against an otherwise unmodified s3gw v0.21 to see how it behaves with your 32 cores? I'm curious to see if you have a similar experience. (I should also mention, when I ran all the above, the sqlite database was on a tmpfs, to avoid getting disk bound on my crappy old SSD) |
Both modes differ in a per connection mutex. So if we go pooled we have a mostly (? - maybe async request handling comes in here) uncontended mutex. Not sure that accounts for the 8% or just margin of error. I think serialized is a good start that we can iterate on. |
The results are all for the create_new_object test? Can you check the get_random_object one? Database writes are serialized and the single threaded result data is dominated by queue time. In the copy case the copies likely add up. The get_random_object is reads only and exercises all the read concurrency we can get from SQLite. Curious about the data from your system. |
Here we go. get_random_object with 100K queries:
I also tried with 1 million queries, and the rates remained it the same ballpark, it just all took 10x longer (s3gw v0.21: 260.238837/s, single connection: 1470.839143/s, pool: 6457.070169/s). |
Related to aquarist-labs/ceph#209 |
Tentatively pushing to v0.23.0 |
Currently, `DBConn` keeps an instance of `Storage`, which is created by `sqlite_orm::make_storage()`. That first instance of `Storage` is long-lived (the `DBConn` c'tor calls `storage->open_forever()`) so the sqlite database is open for the entire life of the program, but this first `Storage` object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics. All the other threads (the workers, the garbage collector, ...) call `DBConn::get_storage()`, which returns a copy of the first `Storage` object. These copies don't have `open_forever()` called on them, which means every time they're used for queries we get a pair of calls to `sqlite3_open()` and `sqlite3_close()` at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. There's a couple of problems with this. One is that the SFS status page only sees the main thread (which is largely boring), and can't track any of the worker threads. The other problem is that something about not keeping the connection open on the worker threads is relatively expensive. If we keep connections open rather than opening and closing with every query, we can get something like a 20x speed increase on read queries, and at least 2x on writes. This new implementation gives one `Storage` object per thread, created on demand as a copy of the first `Storage` object created in the `DBConn` constructor. Fixes: https://github.com/aquarist-labs/s3gw/issues/727 Signed-off-by: Tim Serong <[email protected]>
Currently, `DBConn` keeps an instance of `Storage`, which is created by `sqlite_orm::make_storage()`. That first instance of `Storage` is long-lived (the `DBConn` c'tor calls `storage->open_forever()`) so the sqlite database is open for the entire life of the program, but this first `Storage` object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics. All the other threads (the workers, the garbage collector, ...) call `DBConn::get_storage()`, which returns a copy of the first `Storage` object. These copies don't have `open_forever()` called on them, which means every time they're used for queries we get a pair of calls to `sqlite3_open()` and `sqlite3_close()` at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. There's a couple of problems with this. One is that the SFS status page only sees the main thread (which is largely boring), and can't track any of the worker threads. The other problem is that something about not keeping the connection open on the worker threads is relatively expensive. If we keep connections open rather than opening and closing with every query, we can get something like a 20x speed increase on read queries, and at least 2x on writes. This new implementation gives one `Storage` object per thread, created on demand as a copy of the first `Storage` object created in the `DBConn` constructor. Fixes: https://github.com/aquarist-labs/s3gw/issues/727 Signed-off-by: Tim Serong <[email protected]>
Currently, `DBConn` keeps an instance of `Storage`, which is created by `sqlite_orm::make_storage()`. That first instance of `Storage` is long-lived (the `DBConn` c'tor calls `storage->open_forever()`) so the sqlite database is open for the entire life of the program, but this first `Storage` object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics. All the other threads (the workers, the garbage collector, ...) call `DBConn::get_storage()`, which returns a copy of the first `Storage` object. These copies don't have `open_forever()` called on them, which means every time they're used for queries we get a pair of calls to `sqlite3_open()` and `sqlite3_close()` at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. There's a couple of problems with this. One is that the SFS status page only sees the main thread (which is largely boring), and can't track any of the worker threads. The other problem is that something about not keeping the connection open on the worker threads is relatively expensive. If we keep connections open rather than opening and closing with every query, we can get something like a 20x speed increase on read queries, and at least 2x on writes. This new implementation gives one `Storage` object per thread, created on demand as a copy of the first `Storage` object created in the `DBConn` constructor. Fixes: https://github.com/aquarist-labs/s3gw/issues/727 Signed-off-by: Tim Serong <[email protected]>
Currently, `DBConn` keeps an instance of `Storage`, which is created by `sqlite_orm::make_storage()`. That first instance of `Storage` is long-lived (the `DBConn` c'tor calls `storage->open_forever()`) so the sqlite database is open for the entire life of the program, but this first `Storage` object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics. All the other threads (the workers, the garbage collector, ...) call `DBConn::get_storage()`, which returns a copy of the first `Storage` object. These copies don't have `open_forever()` called on them, which means every time they're used for queries we get a pair of calls to `sqlite3_open()` and `sqlite3_close()` at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. There's a couple of problems with this. One is that the SFS status page only sees the main thread (which is largely boring), and can't track any of the worker threads. The other problem is that something about not keeping the connection open on the worker threads is relatively expensive. If we keep connections open rather than opening and closing with every query, we can get something like a 20x speed increase on read queries, and at least 2x on writes. This new implementation gives one `Storage` object per thread, created on demand as a copy of the first `Storage` object created in the `DBConn` constructor. Fixes: https://github.com/aquarist-labs/s3gw/issues/727 Signed-off-by: Tim Serong <[email protected]>
Currently,
DBConn
keeps an instance ofStorage
, which is created bysqlite_orm::make_storage()
. That first instance ofStorage
is long-lived (the DBConn c'tor callsstorage->open_forever()
) so the sqlite database is open for the entire life of the program, but this firstStorage
object and its associated sqlite database connection pointer are largely not used for anything much after initialization. The exception is the SFS status page, which also uses this connection to report some sqlite statistics.All the other threads (the workers, the garbage collector, ...) call
DBConn::get_storage()
, which returns a copy of the firstStorage
object. These copies don't haveopen_forever()
called on them, which means every time they're used for queries we get a pair of calls tosqlite3_open()
andsqlite3_close()
at the start end end of the query. These calls don't open the main database file again (it's already open) but they do open and close the WAL. An easy way to demonstrate this is:--debug-rgw 20 --rgw-sfs-sqlite-profile
, and tail its log, grepping for "SQLITE PROFILE".strace -p $(pgrep radosgw) -f 2>&1 | grep 'openat(\|close('
s3cmd mb -P s3://foo
strace
shows matching pairs ofopenat()
andclose()
fors3gw.db-wal
, one per query.(Note: I have no idea if there's any noticeable performance impact from all those opens and closes - they might be fine, or they might not.)
This all mostly works just fine, but I (Tim) thought it was weird that we had all these connections opening and closing all the time. I later discovered there are two problems with the current implementation:
sqlite3_db_status()
, to which we pass the pointer to the first database connection. This means we don't see any of the activity on any of the worker threads when looking at stats like cache hits and misses and malloc counts. Maybe we don't actually care about these stats (I don't know) but in any case they're wrong/misleading given they don't pick up any of the worker activity.One way to fix the above two problems is to switch to using one database connection which is shared by all the threads (see ramblings in aquarist-labs/ceph#209). This makes the stats accurate, and causes the WAL growth problem to not happen at all. Unfortunately @irq0 found that a single connection can be about 20x slower than multiple connections when he did some microbenchmarks:
So we really do want multiple connections.
SQLite has two threading modes that we care about, multithreaded and serialized (see https://www.sqlite.org/threadsafe.html). The default is serialized, but from the above it looks like multithreaded might be about 8% faster, so is probably worth experimenting with.
We've already effectively dealt with the WAL growth, so it's no problem to continue to use multiple conections from that perspective. That leaves stats gathering (assuming we care about this). If we switch to using a connection pool, e.g. one
Storage
object per thread, and callopen_forever()
on each of those (and keep copies of their connection pointers) we could later interrogate each one to get reliable stats.Note that multithreaded mode requires that "no single database connection nor any object derived from database connection, such as a prepared statement, is used in two or more threads at the same time". A connection pool as suggested above should ensure this is met (although I suspect that our current implementation probably doesn't run afoul of this restriction given each thread has its own separate copy of
Storage
).By default, radosgw has 512 worker threads, which means that in the connection pool scenario outlined above we pretty quickly end up with 512
Storage
objects, even under light but repeated load (e.g. a single client doing a bunch of reads). This means we easily go over 1024 open file descriptors as each thread has the DB file and the WAL open. I doubt this is a problem in an actual k8s environment (testing on k3s shows an FD limit of 1048576) but you will trip over it in manual testing on baremetal whereulimit -n
is probably still 1024.We have a couple of ways forward here:
Get rid of the per-db-connection stats reporting (because it's currently lies), and do nothing else. Don't bother with a connection pool, don't change threading models. Everything else already works fine, or is fine enough.
Rejig
DBConn::get_storage()
to enable some sort of (possibly configurable) connection pool along the lines of the above, which would allow accurate stats, and possibly safer experimentation with sqlite's multithreaded mode. We've already done a bunch of work in this direction, and I (Tim) am happy to continue if it seems useful, but don't mind dropping it if it's ultimately not going to be worthwhile.The text was updated successfully, but these errors were encountered: