Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single-striped connection pool for each database layer instead of a single shared connection #2416

Merged
merged 21 commits into from
Mar 10, 2021

Conversation

KtorZ
Copy link
Member

@KtorZ KtorZ commented Dec 31, 2020

Issue Number

ADP-586

Overview

  • 73df15f
    📍 use a single-striped connection pool for each database layer

    It is a rather common practice to use a pool of database connection
    when dealing with databases. So far, we've been using a single shared
    connection per wallet worker with, in front of each connection a lock
    preventing concurrent access to the database. The lock is only
    necessary because of the way persistent handles query statements
    internally, in principle, SQLite handles concurrent database accesses
    just well.

    For basic wallets, this is a relatively useless change. But for larger
    wallets like those manipulated by exchanges, we've observed very slow
    response time due to concurrent access of the database lock. Indeed,
    some requests may grab the lock for 10 or 20 seconds, preventing any
    requests from going throug. However, most requests are read-only
    requests and could be executed in parallel, at the discretion of
    the SQLite engine. I hope that the introduction of a connection pool
    will improve the overall experience for large wallets by better
    serving concurrent requests on the database. Finger crossed.

Comments

@KtorZ KtorZ added the IMPROVEMENT Mark a PR as an improvement, for auto-generated CHANGELOG label Dec 31, 2020
@KtorZ KtorZ requested a review from rvl December 31, 2020 15:52
@KtorZ KtorZ self-assigned this Dec 31, 2020
let createConnection = do
let info = mkSqliteConnectionInfo connStr
conn <- Sqlite.open connStr
executeManualMigration migration conn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I originally executed the migration outside of the createConnection with a bracket open / close / executeMigration ... and it was mostly fine except for database views in the DB.Pool. The database property tests would fail with a no such table: <name of the view> as if the migration had not been executed.

I have looked a bit in the documentation but didn't find anything that suggested that views are scoped to a single connection but it seems to be the case. At least, it should be safe running migration on every connection anyway... but still strange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems preferable for the migrations to be executed before other pool connections are allowed. Perhaps there needs to be a transaction commit or something?

@KtorZ KtorZ force-pushed the KtorZ/ADP-586/database-connection-pool branch from 7619e4c to e559e43 Compare December 31, 2020 16:05
@KtorZ
Copy link
Member Author

KtorZ commented Dec 31, 2020

bors try

iohk-bors bot added a commit that referenced this pull request Dec 31, 2020
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Dec 31, 2020

try

Build failed:

  test/unit/Cardano/Wallet/DB/StateMachine.hs:1333:28:
  1) Cardano.Wallet.DB.Sqlite, Validate generators & shrinkers, Shrinker for CreateWallet
       Timed out.

#2393

Copy link
Contributor

@rvl rvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This will help a lot. I don't know why we didn't do this earlier, since resource-pool is so easy to use. I actually assumed that the API server got to use a different DB connection from the restore thread - oops, wrong.

Comment on lines 204 to 283
-- unmasked with the acquired resource. If an asynchronous exception occurs,
-- the resource is NOT placed back in the pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs say that an exception of any type means that the resource is not placed back in the pool.
This will be quite helpful for error recovery I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope.

let createConnection = do
let info = mkSqliteConnectionInfo connStr
conn <- Sqlite.open connStr
executeManualMigration migration conn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems preferable for the migrations to be executed before other pool connections are allowed. Perhaps there needs to be a transaction commit or something?

createConnection
destroyConnection
numberOfStripes
timeToLive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pool thread is cancelled (i.e. wallet is deleted or server needs to exit), those lingering pool connections are going to get cleaned up promptly, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I swapped the destroyDBLayer function with:

destroyDBLayer :: Tracer IO DBLog -> SqliteContext -> IO ()
destroyDBLayer tr SqliteContext{connectionPool,dbFile} = do
    traceWith tr (MsgDestroyConnectionPool dbFile)
    destroyAllResources connectionPool

So this is done as part of the withDBLayer bracket. The same quirks as before with regards to closing a single SQLite connection still applies but destroyConnection now does exactly what destroyDBLayer was doing, that is, retry for some cherry-picked exceptions and handle already closed connections as well.

lib/core/src/Cardano/DB/Sqlite.hs Outdated Show resolved Hide resolved
@KtorZ
Copy link
Member Author

KtorZ commented Jan 4, 2021

I don't know why we didn't do this earlier

Yes. Sometimes we forget the basics.

@KtorZ
Copy link
Member Author

KtorZ commented Jan 4, 2021

bors merge

iohk-bors bot added a commit that referenced this pull request Jan 4, 2021
2416: Use a single-striped connection pool for each database layer instead of a single shared connection r=KtorZ a=KtorZ

# Issue Number

<!-- Put here a reference to the issue that this PR relates to and which requirements it tackles. Jira issues of the form ADP- will be auto-linked. -->

ADP-586

# Overview

<!-- Detail in a few bullet points the work accomplished in this PR -->

- 73df15f
  📍 **use a single-striped connection pool for each database layer**

  It is a rather common practice to use a pool of database connection
  when dealing with databases. So far, we've been using a single shared
  connection per wallet worker with, in front of each connection a lock
  preventing concurrent access to the database. The lock is only
  necessary because of the way persistent handles query statements
  internally, in principle, SQLite handles concurrent database accesses
  just well.

  For basic wallets, this is a relatively useless change. But for larger
  wallets like those manipulated by exchanges, we've observed very slow
  response time due to concurrent access of the database lock. Indeed,
  some requests may grab the lock for 10 or 20 seconds, preventing any
  requests from going throug. However, most requests are read-only
  requests and could be executed in parallel, at the discretion of
  the SQLite engine. I hope that the introduction of a connection pool
  will improve the overall experience for large wallets by better
  serving concurrent requests on the database. Finger crossed.


# Comments

<!-- Additional comments or screenshots to attach if any -->

<!--
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Jira will detect and link to this PR once created, but you can also link this PR in the description of the corresponding ticket
 ✓ Acknowledge any changes required to the Wiki
 ✓ Finally, in the PR description delete any empty sections and all text commented in <!--, so that this text does not appear in merge commit messages.
-->


Co-authored-by: KtorZ <[email protected]>
Co-authored-by: IOHK <[email protected]>
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Jan 4, 2021

Build failed:

#expected

@KtorZ
Copy link
Member Author

KtorZ commented Jan 4, 2021

☝️ looks pretty serious, it's full chaos in the integration cluster. My guess is the database reference counter not counting properly since the mutable counter only count DBLayer but not individual connections so we are back with race-condition with regards to database deletion.

-- ^ A handle to the Persistent SQL backend.
, isDatabaseActive :: TVar Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link to https://www.sqlite.org/wal.html#busy might help.

@KtorZ KtorZ force-pushed the KtorZ/ADP-586/database-connection-pool branch 3 times, most recently from 224f675 to 1b088a4 Compare January 5, 2021 11:32
@KtorZ
Copy link
Member Author

KtorZ commented Jan 5, 2021

bors merge

iohk-bors bot added a commit that referenced this pull request Jan 5, 2021
2416: Use a single-striped connection pool for each database layer instead of a single shared connection r=KtorZ a=KtorZ

# Issue Number

<!-- Put here a reference to the issue that this PR relates to and which requirements it tackles. Jira issues of the form ADP- will be auto-linked. -->

ADP-586

# Overview

<!-- Detail in a few bullet points the work accomplished in this PR -->

- 73df15f
  📍 **use a single-striped connection pool for each database layer**

  It is a rather common practice to use a pool of database connection
  when dealing with databases. So far, we've been using a single shared
  connection per wallet worker with, in front of each connection a lock
  preventing concurrent access to the database. The lock is only
  necessary because of the way persistent handles query statements
  internally, in principle, SQLite handles concurrent database accesses
  just well.

  For basic wallets, this is a relatively useless change. But for larger
  wallets like those manipulated by exchanges, we've observed very slow
  response time due to concurrent access of the database lock. Indeed,
  some requests may grab the lock for 10 or 20 seconds, preventing any
  requests from going throug. However, most requests are read-only
  requests and could be executed in parallel, at the discretion of
  the SQLite engine. I hope that the introduction of a connection pool
  will improve the overall experience for large wallets by better
  serving concurrent requests on the database. Finger crossed.


# Comments

<!-- Additional comments or screenshots to attach if any -->

<!--
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Jira will detect and link to this PR once created, but you can also link this PR in the description of the corresponding ticket
 ✓ Acknowledge any changes required to the Wiki
 ✓ Finally, in the PR description delete any empty sections and all text commented in <!--, so that this text does not appear in merge commit messages.
-->


Co-authored-by: KtorZ <[email protected]>
Co-authored-by: IOHK <[email protected]>
@KtorZ
Copy link
Member Author

KtorZ commented Jan 5, 2021

bors r-

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Jan 5, 2021

Canceled.

@KtorZ
Copy link
Member Author

KtorZ commented Jan 5, 2021

Forgot to push the latest change :|

@KtorZ KtorZ force-pushed the KtorZ/ADP-586/database-connection-pool branch from 1b088a4 to 2f5246c Compare January 5, 2021 17:24
@KtorZ
Copy link
Member Author

KtorZ commented Jan 5, 2021

bors try

iohk-bors bot added a commit that referenced this pull request Jan 5, 2021
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Jan 5, 2021

try

Build failed:

#expected: extra dependencies detected by weeder

@rvl rvl force-pushed the KtorZ/ADP-586/database-connection-pool branch from 2f5246c to d8f3c74 Compare January 6, 2021 06:10
Copy link
Contributor

@rvl rvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks for this!

If I run the integration tests with NO_CLEANUP=1, then ctrl-c them, there are no longer -shm and -wal files left behind in the tests temporary directory. 🎉

Even better, wallets which were supposed to be deleted, actually seem to have been deleted.

Comment on lines 243 to 246
-- runSqlConn is guarded with a lock because it's not threadsafe in
-- general.It is also masked, so that the SqlBackend state is not
-- corrupted if a thread gets cancelled while running a query.
-- See: https://github.com/yesodweb/persistent/issues/981
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in future we could drop the lock and mask_?

  1. If we are confident that any pool connection will only be used by one thread at a time, then the lock isn't needed.
  2. If the thread is cancelled during a database operation, then so be it. SQLite will rollback the uncommitted transaction. The connection won't be returned to the pool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) is only true if we can start the database in full_mutex I guess, which persistent doesn't really allow us to set that flag (although, we've forked it so it's easily added). This will not prevent retries however in case two writing threads are trying to write something at the same time, but at least it should give more concurrency to readers 🤔 Also "if we are confident that any pool connection will only be used by one thread at a time" ==> I have no such confidence at this stage unfortunately :(

(2) Hmmm. That's right. Sounds scary though 😱

retryOnBusy :: Tracer IO DBLog -> IO a -> IO a
retryOnBusy tr action =
recovering policy (mkRetryHandler isBusy) $ \RetryStatus{rsIterNumber} -> do
when (rsIterNumber > 0) $ traceWith tr (MsgRetryOnBusy rsIterNumber)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that there's logging. With the resource pool improvements. it will be interesting to see if this error ever happens in tests.

@rvl
Copy link
Contributor

rvl commented Jan 6, 2021

Fixed a weeder error and rebased.

bors r+

KtorZ added 3 commits March 9, 2021 13:40
  It is a rather common practice to use a pool of database connection
  when dealing with databases. So far, we've been using a single shared
  connection per wallet worker with, in front of each connection a lock
  preventing concurrent access to the database. The lock is only
  necessary because of the way persistent handles query statements
  internally, in principle, SQLite handles concurrent database accesses
  just well.

  For basic wallets, this is a relatively useless change. But for larger
  wallets like those manipulated by exchanges, we've observed very slow
  response time due to concurrent access of the database lock. Indeed,
  some requests may grab the lock for 10 or 20 seconds, preventing any
  requests from going throug. However, most requests are read-only
  requests and could be executed in parallel, at the discretion of
  the SQLite engine. I hope that the introduction of a connection pool
  will improve the overall experience for large wallets by better
  serving concurrent requests on the database. Finger crossed.
  I ran into quite a few issues with the integration tests since the
  unliftio merge and rebase (I think, as I am pretty I did observe unit
  and integration tests doing just fine with the resource pool at least
  once). I've been investigating this for most of the day, and found a few
  interesting cases:

  (a) SQLite may return 'SQLITE_BUSY' on pretty much any requests if two
      concurrent write queries hit the engine; though we
      currently only catch this kind of exception when we try closing the
      database so I generalized a bit our error handling here.

  (b) It seems that calling destroyAllResources from resource-pool does not
      prevent new threads from acquiring new resources. And there's no way
      with the resource-pool library itself to prevent the creation of
      new resources after a certain point. So it may happen that while the
      database layer is being destroyed, new database connections are created
      and start causing conflicts between each others.
  This avoids the need for an extra 'TVar Bool' to guard the connection
  pool from threads whishing to acquire new resources. Instead, we can
  wrap the pool acquisition in a bracket: `bracket createPool destroyAllResources`
  so that the pool is cleaned up when done and we are sure that no
  thread will attempt to acquire a new resource while
  destroyAllResources is called.

  This sole change wasn't as straightforward as I wanted because it
  moves the control of the `SqliteContext` up in the stack and therefore
  requires reviewing many more parts of both the pool and wallet db
  layers. I think it's for a greater good in the end and make them both
  slightly better / robust. In the end, it is still a bit "awkward" that
  we have constructors / functions in those modules that are solely used
  by the test code and not by the actual application (this is the case
  of 'withDBLayer' for instance...).

  To not over-complicate things, I ended up handling the in-memory and
  in-file SqliteContext setup a bit differently. Incidentally I realized
  later that we run most of our unit-tests on the 'in-memory' version;
  which means that we aren't testing the resource pool in the context of
  the unit tests. I am not sure whether this is a good thing or not: it
  makes the unit tests a bit more focus on testing the actual business
  logic, and we still have the system-level integration tests to put the
  resource pool under great stress.
@rvl rvl force-pushed the KtorZ/ADP-586/database-connection-pool branch from b3059d0 to 0cec0b5 Compare March 9, 2021 08:28
rvl added 18 commits March 9, 2021 18:52
This function will be useful in criterion benchmarks.

It also fixes handling of exceptions while allocating resources.
If the thread is cancelled, we want the query stopped immediately.

Anything uncommitted will be rolled back by persistent.
The deleteWallet handler opens a database connection then tries to
remove the database before closing the connection. But the sqlite
removeDatabase function waits for all connections to be closed first.

The livelock is only broken after the 1 minute timeout in
removeDatabase.
It was bothering me.
And add logging of database checkpoint cache
@rvl rvl force-pushed the KtorZ/ADP-586/database-connection-pool branch from 0cec0b5 to 7729917 Compare March 9, 2021 08:54
@rvl
Copy link
Contributor

rvl commented Mar 9, 2021

Rebased and fixed merge conflicts again....

bors r+

iohk-bors bot added a commit that referenced this pull request Mar 9, 2021
2416: Use a single-striped connection pool for each database layer instead of a single shared connection r=rvl a=KtorZ

# Issue Number

<!-- Put here a reference to the issue that this PR relates to and which requirements it tackles. Jira issues of the form ADP- will be auto-linked. -->

ADP-586

# Overview

<!-- Detail in a few bullet points the work accomplished in this PR -->

- 73df15f
  📍 **use a single-striped connection pool for each database layer**

  It is a rather common practice to use a pool of database connection
  when dealing with databases. So far, we've been using a single shared
  connection per wallet worker with, in front of each connection a lock
  preventing concurrent access to the database. The lock is only
  necessary because of the way persistent handles query statements
  internally, in principle, SQLite handles concurrent database accesses
  just well.

  For basic wallets, this is a relatively useless change. But for larger
  wallets like those manipulated by exchanges, we've observed very slow
  response time due to concurrent access of the database lock. Indeed,
  some requests may grab the lock for 10 or 20 seconds, preventing any
  requests from going throug. However, most requests are read-only
  requests and could be executed in parallel, at the discretion of
  the SQLite engine. I hope that the introduction of a connection pool
  will improve the overall experience for large wallets by better
  serving concurrent requests on the database. Finger crossed.


# Comments

<!-- Additional comments or screenshots to attach if any -->

<!--
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Jira will detect and link to this PR once created, but you can also link this PR in the description of the corresponding ticket
 ✓ Acknowledge any changes required to the Wiki
 ✓ Finally, in the PR description delete any empty sections and all text commented in <!--, so that this text does not appear in merge commit messages.
-->


Co-authored-by: KtorZ <[email protected]>
Co-authored-by: Rodney Lorrimar <[email protected]>
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Mar 9, 2021

Build failed:

Failures:

  src/Test/Integration/Scenario/API/Shelley/Transactions.hs:2140:27:
  1) API Specifications, SHELLEY_TRANSACTIONS, TRANS_LIST_02,03x - Can limit/order results with start, end and order
       While verifying (Status {statusCode = 200, statusMessage = "OK"},Right [])
       expected: 1
        but got: 0

  To rerun use: --match "/API Specifications/SHELLEY_TRANSACTIONS/TRANS_LIST_02,03x - Can limit/order results with start, end and order/"

Randomized with seed 64164505

Finished in 1651.8216 seconds
756 examples, 1 failure, 40 pending

@rvl
Copy link
Contributor

rvl commented Mar 10, 2021

Please

bors r+

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Mar 10, 2021

Build succeeded:

@iohk-bors iohk-bors bot merged commit 5379123 into master Mar 10, 2021
@iohk-bors iohk-bors bot deleted the KtorZ/ADP-586/database-connection-pool branch March 10, 2021 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IMPROVEMENT Mark a PR as an improvement, for auto-generated CHANGELOG
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants