Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

[experimental] rgw/sfs: use only one sqlite database connection #209

Closed

Conversation

tserong
Copy link
Member

@tserong tserong commented Sep 13, 2023

Currently there is one long lived connection to the sqlite database, but also many other threads use their own connections (for some analysis of this, see #201).

This PR changes all the copies of the storage object to references, which means we're actually only using one db connection now. It's a bit irritating to do this, because it's way too easy to accidentally make a copy if you leave the '&' off :-/ I'd really want to somehow disable copy construction of the Storage object, but I didn't figure out how to do that yet.

One interesting effect of this change is that, prior to this commit, the SFS status page only showed SQLite stats for the connection from the status thread, which is not overly helpful. With this commit (all threads using the same connection), the figures from the SQLite stats will actually change over time while s3gw is being used.

Note that as we're building with -DSQLITE_THREADSAFE=1, i.e. we're using Serialized mode, it's totally cool to have one connection shared by multiple threads (see https://www.sqlite.org/threadsafe.html)

@tserong tserong marked this pull request as draft September 13, 2023 07:28
@tserong tserong added the ci/build-s3gw-image Build and push a container image label Sep 13, 2023
@github-actions
Copy link

📦 quay.io/s3gw/s3gw:pr-6002dde8ea5e72212a3253ded2e53afce53bc394-6169337814-1 https://quay.io/repository/s3gw/s3gw?tab=tags&tag=pr-6002dde8ea5e72212a3253ded2e53afce53bc394-6169337814-1

@tserong
Copy link
Member Author

tserong commented Sep 13, 2023

Yeah, OK, this ain't gonna work (I know you've already been through this @0xavi0, but I was in the area and wanted to experiment a bit for the sake of my own understanding of how all the pieces fit).

TL;DR: if everything uses the same connection, we get a bunch of [SQLITE] (1) statement aborts at 1: [BEGIN TRANSACTION] cannot start a transaction within a transaction, which IIUC is not a problem with sqlite and threads per se, but rather that we use transactions from multiple threads, and if those threads use the same connection you can hit races on transactions. I guess this could maybe be solved with a mutex or something around the transactions. Not sure if that's actually worth investigating.

@tserong
Copy link
Member Author

tserong commented Sep 14, 2023

I guess this could maybe be solved with a mutex or something around the transactions. Not sure if that's actually worth investigating.

Eh, while I'm here, why not...? ;-)

@tserong
Copy link
Member Author

tserong commented Sep 14, 2023

OK, that's actually a lot better with the mutex! No more [SQLITE] (1) statement aborts at 1: [BEGIN TRANSACTION] cannot start a transaction within a transaction. Now all we have is a few file unlinked while open: /tmp/FH912-6h-OWZfi1GEgqhMdE/s3gw.db, which I believe is an artifact of the way the test is written.

@tserong tserong added ci/build-s3gw-image Build and push a container image and removed ci/build-s3gw-image Build and push a container image labels Sep 14, 2023
@github-actions
Copy link

📦 quay.io/s3gw/s3gw:pr-2cf997b980ca03ee54b87ec46241ebbdac036ed9-6181182460-1 https://quay.io/repository/s3gw/s3gw?tab=tags&tag=pr-2cf997b980ca03ee54b87ec46241ebbdac036ed9-6181182460-1

@tserong
Copy link
Member Author

tserong commented Sep 14, 2023

Now all we have is a few file unlinked while open: /tmp/FH912-6h-OWZfi1GEgqhMdE/s3gw.db, which I believe is an artifact of the way the test is written.

Yeah, that file unlinked while open business is present even without this change. I'll fix that separately.

@0xavi0
Copy link

0xavi0 commented Sep 14, 2023

Yeah, OK, this ain't gonna work (I know you've already been through this @0xavi0, but I was in the area and wanted to experiment a bit for the sake of my own understanding of how all the pieces fit).

Yeah, but take into account that I looked at this when the s3gw was a POC and we had no time to get too deep into things.
I saw it was giving us issues and we went the multi connection way, but nice that somebody spends time and investigates more.

We also had a mutex at the beginning, for example, before WAL was enabled.

I found this funny answer in the FAQ of SQLite :)

@tserong
Copy link
Member Author

tserong commented Sep 14, 2023

I found this funny answer in the FAQ of SQLite :)

Oh I love that :-)

@tserong
Copy link
Member Author

tserong commented Sep 14, 2023

Yeah, that file unlinked while open business is present even without this change. I'll fix that separately.

#210 should take care of that.

@tserong
Copy link
Member Author

tserong commented Sep 14, 2023

There's something still not quite right here:

2023-09-14T08:37:07.155+0000 7fd150697f80 -1 [SQLITE] (21) misuse at line 171122 of [4635f4a69c]
2023-09-14T08:37:07.223+0000 7fd150697f80 -1 [SQLITE] (21) misuse at line 171122 of [4635f4a69c]
2023-09-14T08:37:07.247+0000 7fd140651700 -1 [SQLITE] (21) API call with NULL database connection pointer
2023-09-14T08:37:07.247+0000 7fd140651700 -1 [SQLITE] (21) misuse at line 135360 of [4635f4a69c]
terminate called after throwing an instance of 'std::system_error'
  what():  out of memory: out of memory

A bit more tracing and testing locally shows that sqlite_orm is somehow still internally using two or three different sqlite3* pointers, which shouldn't be possible with the changes I've made to have only one connection. Of course I must be missing some detail, I just haven't figured out what that detail is yet...

@tserong
Copy link
Member Author

tserong commented Sep 15, 2023

As for these other failures...

2023-09-14T08:37:07.155+0000 7fd150697f80 -1 [SQLITE] (21) misuse at line 171122 of [4635f4a69c]
2023-09-14T08:37:07.223+0000 7fd150697f80 -1 [SQLITE] (21) misuse at line 171122 of [4635f4a69c]
2023-09-14T08:37:07.247+0000 7fd140651700 -1 [SQLITE] (21) API call with NULL database connection pointer
2023-09-14T08:37:07.247+0000 7fd140651700 -1 [SQLITE] (21) misuse at line 135360 of [4635f4a69c]

...it turns out they're due to sqlite_orm's retain_count not being thread safe. Happily, this is fixed in fnc12/sqlite_orm#1054, but unfortunately we're still using sqlite_orm v1.7.1, which doesn't include that fix. We need to either update @jecluis's package (https://build.opensuse.org/package/show/home:jluis/libsqliteorm) to v1.8+, or bring sqlite_orm in as a submodule as @irq0 suggested in https://github.com/aquarist-labs/s3gw/issues/683 to pick up this fix.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot added the needs-rebase Changes need a rebase label Sep 15, 2023
@tserong tserong force-pushed the wip-only-use-one-db-connection branch from e6c1ca4 to 5ddc2ba Compare September 19, 2023 12:14
@github-actions github-actions bot removed the needs-rebase Changes need a rebase label Sep 19, 2023
@tserong tserong force-pushed the wip-only-use-one-db-connection branch from 5ddc2ba to dbff419 Compare September 19, 2023 12:16
@tserong
Copy link
Member Author

tserong commented Sep 19, 2023

Hrm. TestSFSConcurrency is still failing even after the update to sqlite_orm v1.8.2. Given it works fine when I run that test on my machine locally, I'm going to assume for the moment that because the build environment still also has our older libsqliteorm package installed system wide, that maybe the build is picking that up instead. I guess this'll have to wait for #212 and https://github.com/aquarist-labs/s3gw/pull/713 for further testing.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot added the needs-rebase Changes need a rebase label Sep 20, 2023
@tserong tserong force-pushed the wip-only-use-one-db-connection branch from dbff419 to ec52a78 Compare September 20, 2023 11:09
@github-actions github-actions bot removed the needs-rebase Changes need a rebase label Sep 20, 2023
@tserong tserong force-pushed the wip-only-use-one-db-connection branch from ec52a78 to a88394f Compare September 20, 2023 12:39
@tserong tserong added ci/build-s3gw-image Build and push a container image and removed ci/build-s3gw-image Build and push a container image labels Sep 20, 2023
@github-actions
Copy link

📦 quay.io/s3gw/s3gw:pr-c6699302f5ccace6467b85d20484f292028b38b7-6248808399-2 https://quay.io/repository/s3gw/s3gw?tab=tags&tag=pr-c6699302f5ccace6467b85d20484f292028b38b7-6248808399-2

This ensures we only have one storage object, and everything that needs to
use it does so via StorageRef, which is a std::shared_ptr.  Storage has been
renamed StorageImpl, to make sure I caught all the previous uses of Storage
elsewhere.  If we want to change this to a different type in future, it
should be possible to do so just by changing the type of StorageRef in
dbconn.h and dbconn.cc - so long as the new type behaves like a pointer,
everything Should Just Work(TM).

Signed-off-by: Tim Serong <[email protected]>
@tserong
Copy link
Member Author

tserong commented Sep 28, 2023

I'm almost confident about this now, except for one niggling little problem - when running the unit tests locally (specifically TestSFSWALCheckpoint), I've occasionally hit this:

/home/tserong/src/github/aquarist-labs/ceph/src/rgw/driver/sfs/types.cc: In function 'bool rgw::sal::sfs::Object::metadata_finish(rgw::sal::SFStore*, bool) const' thread 7f19db3806c0 time 2023-09-28T19:24:47.011554+1000
/home/tserong/src/github/aquarist-labs/ceph/src/rgw/driver/sfs/types.cc: 230: FAILED ceph_assert(db_versioned_object.has_value())
 ceph version s3gw-v0.21.0-rc1-12-gd5b918ca695 (d5b918ca6958e01cf2b70242452332b910d60c26) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f19e0586637]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f19e058683e]
 3: (rgw::sal::sfs::Object::metadata_finish(rgw::sal::SFStore*, bool) const+0x1ee) [0x5628b9e7496c]
 4: (TestSFSWALCheckpoint::multithread_object_create(unsigned long, unsigned long)::{lambda()#1}::operator()() const+0x196) [0x5628b9836fee]
 5: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<TestSFSWALCheckpoint::multithread_object_create(unsigned long, unsigned long)::{lambda()#1}> > >::_M_run()+0xd) [0x5628b98370e5]
 6: /lib64/libstdc++.so.6(+0xe89c3) [0x7f19dece89c3]
 7: /lib64/libc.so.6(+0x8ffa4) [0x7f19de88ffa4]
 8: /lib64/libc.so.6(+0x1187fc) [0x7f19de9187fc]
Aborted (core dumped)

I'd still like to see if I can get a reliable reproducer for that, to verify if it's something stupid in my test, or if there's some underlying issue that needs to be dealt with.

@jecluis
Copy link
Member

jecluis commented Sep 28, 2023

do you have a log file for that? Might be interesting to see what's happening. Because that smells a lot like a bug.

@tserong
Copy link
Member Author

tserong commented Sep 28, 2023

No, no log file, just that snippet that was spat out while running the unit tests, which is very little to go on :-/

@irq0
Copy link
Member

irq0 commented Sep 29, 2023

I'm almost confident about this now, except for one niggling little problem - when running the unit tests locally (specifically TestSFSWALCheckpoint), I've occasionally hit this:

/home/tserong/src/github/aquarist-labs/ceph/src/rgw/driver/sfs/types.cc: In function 'bool rgw::sal::sfs::Object::metadata_finish(rgw::sal::SFStore*, bool) const' thread 7f19db3806c0 time 2023-09-28T19:24:47.011554+1000
/home/tserong/src/github/aquarist-labs/ceph/src/rgw/driver/sfs/types.cc: 230: FAILED ceph_assert(db_versioned_object.has_value())
 ceph version s3gw-v0.21.0-rc1-12-gd5b918ca695 (d5b918ca6958e01cf2b70242452332b910d60c26) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f19e0586637]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f19e058683e]
 3: (rgw::sal::sfs::Object::metadata_finish(rgw::sal::SFStore*, bool) const+0x1ee) [0x5628b9e7496c]
 4: (TestSFSWALCheckpoint::multithread_object_create(unsigned long, unsigned long)::{lambda()#1}::operator()() const+0x196) [0x5628b9836fee]
 5: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<TestSFSWALCheckpoint::multithread_object_create(unsigned long, unsigned long)::{lambda()#1}> > >::_M_run()+0xd) [0x5628b98370e5]
 6: /lib64/libstdc++.so.6(+0xe89c3) [0x7f19dece89c3]
 7: /lib64/libc.so.6(+0x8ffa4) [0x7f19de88ffa4]
 8: /lib64/libc.so.6(+0x1187fc) [0x7f19de9187fc]
Aborted (core dumped)

I'd still like to see if I can get a reliable reproducer for that, to verify if it's something stupid in my test, or if there's some underlying issue that needs to be dealt with.

Just a hunch, but going single connection means we loose some isolation.

https://www.sqlite.org/isolation.html#:~:text=No%20Isolation%20Between%20Operations%20On%20The%20Same%20Database%20Connection

@irq0
Copy link
Member

irq0 commented Sep 29, 2023

This PR doesn't seem to have any noticeable effect on performance perf-6321780797.pdf

My read of SQLite is that going single connection means we serialize everything on that connection. With serialized thread mode there is a database-level mutex around operations including sqlite3_step.

Microbenchmarks confirm and explain why the system benchmarks don't show any difference.

Setup: 100000 distinct objects. 1M queries spread over $nproc threads (nproc=32). Measuring query start to finish time (execution + busy timeout)

multiple connections + sqlite multithreaded: avg rate: 11273.957159/s - all CPUs 100%
single connection + sqlite serialized: avg rate: 534.600408/s - roughly 200 % combined CPU
multiple connections + sqlite serialized: avg rate: 10387.883573/s - all CPUs 100%

The system benchmark reports about 130 S3 OP/s. There is usually just a handful of SQLite queries per S3 OP.
Therefore I doubt the benchmark saturates SQLite and we get any signal there.

@irq0
Copy link
Member

irq0 commented Sep 29, 2023

I'm pretty sure the only thing I need to do now to really solidify this is come up with a way of forcing the Storage object to be noncopyable to avoid potential mistakes in coding where we might otherwise accidentally make a copy rather than taking a reference. Maybe having DBConn::get_storage() return a shared pointer instead...?

My choice would be returning a pointer and having DBConn own all Storage objects. It already has the same lifetime as the SFS backend. So, we don't need refcounting.

I think pooling the Storage object would be worthwhile. Maybe with a map of thread id -> Storage, lazy initialized Storage on first access. This would be bounded by the limited number of beast worker threads we have. Always handing out the same connection per thread should mix well with what we have right now. Should we ever decide on going async we might run into trouble with isolation tough.

@tserong
Copy link
Member Author

tserong commented Oct 4, 2023

Just a hunch, but going single connection means we loose some isolation.

https://www.sqlite.org/isolation.html#:~:text=No%20Isolation%20Between%20Operations%20On%20The%20Same%20Database%20Connection

That's interesting, although in the case of the intermittent assert I'm seeing, AFAICT looking at the code the writes should all be done (obj = bucket->create_version(...) before the read happens (db_object = dbobjs.get_object(path.get_uuid()); in Object::metadata_finish()) which matches "Within a single database connection X, a SELECT statement always sees all changes to the database that are completed prior to the start of the SELECT statement, whether committed or uncommitted.":

std::vector<std::thread> threads;
for (size_t i = 0; i < num_threads; ++i) {
std::thread t([&, i]() {
for (size_t j = 0; j < num_objects; ++j) {
ObjectRef obj;
while (!obj) {
obj = bucket->create_version(rgw_obj_key(
"object-" + std::to_string(i) + "-" + std::to_string(j)
));
}
obj->metadata_finish(store.get(), false);
update_maximum(max_wal_size, fs::file_size(wal));
}
});
threads.push_back(std::move(t));
}

I must be missing some detail somewhere...

@tserong
Copy link
Member Author

tserong commented Oct 4, 2023

multiple connections + sqlite multithreaded: avg rate: 11273.957159/s - all CPUs 100%
single connection + sqlite serialized: avg rate: 534.600408/s - roughly 200 % combined CPU
multiple connections + sqlite serialized: avg rate: 10387.883573/s - all CPUs 100%

Does that mean single connection + sqlite serialized is 10x faster than the other two options, but uses twice as much CPU?

No, that was just me misreading the data :-/ It means 534 queries/second with single connection serialized, vs. ~10K queries/second in the other two cases.

There's no point using a shared_ptr for the Storage object as its lifetime is
bound to the DBConn, so just return a regular pointer.

Signed-off-by: Tim Serong <[email protected]>
@tserong
Copy link
Member Author

tserong commented Oct 4, 2023

My choice would be returning a pointer and having DBConn own all Storage objects. It already has the same lifetime as the SFS backend. So, we don't need refcounting.

Right, makes sense.

I think pooling the Storage object would be worthwhile. Maybe with a map of thread id -> Storage, lazy initialized Storage on first access. This would be bounded by the limited number of beast worker threads we have. Always handing out the same connection per thread should mix well with what we have right now. Should we ever decide on going async we might run into trouble with isolation tough.

Alright, I'll see what I can do :-)

This gives one Storage object per thread, created on demand as a copy of
the first Storage object created in the DBConn constructor.  I'm not sure
whether we need the lock in get_storage(), but suspect we probably do given
we're potentially modifying the storage_pool map.

Right now this is just a std::map<pthread_t, StorageImpl>, but I reckon I
might be able to change the latter to a struct or a pair or something that
_also_ included a pointer to each StorageImpl's raw sqlite3 connection,
via some trickery in the storage->on_open callback.  This would potentially
allow calling sqlite3_db_status() on each connection to get runtime stats.

We might also be able to get rid of the "sfs_db_lock" mutex I added earlier
if we're back to using multiple connections, becasue multiple connections
should avoid the "transaction inside transaction" problem I had earlier.

Signed-off-by: Tim Serong <[email protected]>
This shouldn't be necessary now that we've got separate connections per
thread again.  I've left it there, but renamed to DBConn::sfs_db_lock
just in case we do need it again in future.

Signed-off-by: Tim Serong <[email protected]>
@tserong
Copy link
Member Author

tserong commented Oct 4, 2023

...and with the connection pool, we're back to potential WAL explosion, so I'll have to change those tests back again.

This mostly reverts a570e9c, because
now that we're using multiple connections again (albeit pooled), the
WAL growth happens again.

Signed-off-by: Tim Serong <[email protected]>
Since adding storage_pool, the WAL doesn't explode quite as much as
it used to with 10 threads and 1000 objects each (previously with
multiple unpooled connections it'd reliably go over 500MB, but I've
just seen a unit test where it "only" got to ~450MB, so let's drop
the value we test against a bit such that we still confirm the
problem, but have more wiggle room).

Signed-off-by: Tim Serong <[email protected]>
This changes storage_pool to std::unordered_map (faster) and switches to
using a std::shared_mutex in DBConn::get_storage() which is better for
concurrency.  Credit goes to Marcel Lauhoff for this implementation.

Signed-off-by: Tim Serong <[email protected]>
@github-actions github-actions bot added the needs-rebase Changes need a rebase label Oct 11, 2023
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tserong tserong removed the ci/build-s3gw-image Build and push a container image label Oct 19, 2023
@tserong tserong mentioned this pull request Oct 19, 2023
11 tasks
@tserong
Copy link
Member Author

tserong commented Oct 19, 2023

Closing in favour of #233

@tserong tserong closed this Oct 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-rebase Changes need a rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants