Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: use WAL mode and set pragmas correctly #98

Merged
merged 7 commits into from
Sep 18, 2024
Merged

Conversation

moio
Copy link
Contributor

@moio moio commented Aug 30, 2024

This

  • sets PRAGMAs correctly, before they were ignored as they used a wrong syntax
  • removes the shared=cache option as sql: drop the attachdriver mechanism #97 removed the need for it
  • removes the mechanism retrying on SQLITE_BUSY, which is then unnecessary
  • adds code to issue BEGIN on read-only transactions and BEGIN IMMEDIATE for transactions expected to write

The last point is necessary in order to avoid SQLITE_BUSY (5) errors in presence of multiple concurrent writing transactions. Without it busy_timeout would not be sufficient, see discussion below.

best reviewed commit-by-commit

Benchmark notes

Results from a quick and rough benchmark on my dev environment look good - performance is on par or superior compared to the situation without this PR.

This test lists the first page of ConfigMaps with a varying number of ConfigMaps: 0 up to 6000. Also, a number of ConfigMaps are changed (added and deleted) every second: 0 up to 400.

Results are the p(95) of the whole request time:

image

(Grey is without this PR, green is with this PR. Lower is better)

Note 1: the test at 6000 ConfigMaps failed with a SQLite error after 100 changes per second: this PR also fixes that bug 😇

Note 2: results are also robustly better than running with Vai off (no SQLite pagination):

image

(Grey is without this PR, green is with this PR, red is with Vai off. Lower is better)

To replicate use:
https://gist.github.com/moio/20a43fa406432cec5fc8aee0394d3e1f

With these scripts from the dartboard project:

https://github.com/rancher/dartboard/blob/main/k6/api_benchmark.js
https://github.com/rancher/dartboard/blob/main/k6/create_k8s_resources.js
https://github.com/rancher/dartboard/blob/main/k6/change_config_maps.js

@moio moio requested a review from a team as a code owner August 30, 2024 15:04
@moio moio requested a review from tomleb August 30, 2024 15:05
@moio moio changed the title Use WAL mode and set pragmas correctly sql: use WAL mode and set pragmas correctly Aug 30, 2024
tomleb
tomleb previously approved these changes Sep 4, 2024
Copy link
Contributor

@tomleb tomleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

A long comment that is potentially in the wrong issue follows..

busy_timeout doesn't always work as we expect with deferred transactions (the default).

Here's a few examples that all start with the same setup. The prefix X: is the DB connection X:

setup

A: create table T(t text);
A: create table R(r text);
A: insert into T values ('a'), ('b');
A: insert into R values ('a'), ('b');
A: begin;
A: update T set t='x' where t='a';
B: PRAGMA busy_timeout=5000;

select same table

Succeeds (as expected because WAL allows concurrent read+writes).

B: begin;
B: select * from T;
a
b
B: commit;

update different table

Blocks 5 seconds (busy_timeout) and returns an error.

B: begin;
B: update R set r='x' where r='a';
# Blocks for 5 seconds
Runtime error: database is locked (5)

select+update different table

Immediately returns an error. Doesn't respect the busy_timeout set. I can see this case happening with the Replace() method (we're doing both read+write). eg: New Pods event comes in and at the same time, Namespaces resyncs.

Note: Retrying the update will keep returning an error even when A's transaction is committed. Yes, even if A updated a different table.

B: begin;
B: select * from R;
a
b
B: update R set r='x' where r='a';
Runtime error: database is locked (5)
B: update R set r='x' where r='a';
Runtime error: database is locked (5)
A: commit;
B: update R set r='x' where r='a';
Runtime error: database is locked (5)

select+update with immediate

Contrast the above scenario with immediate mode transactions. It basically tries to get a lock immediately instead of on the first write command. The rest of the transaction is then guaranteed to not get database is locked error.

B: begin immediate;
# Blocks for 5 seconds waiting for commit
Runtime error: database is locked (5)
A: commit;
B: begin immediate;
B: select * from R;
a
b
B: update R set r='x' where r='a';
B: commit;

Should we move the retry logic elsewhere rather than removing it OR use immediate transactions OR use a single connection specifically for writes?

@moio
Copy link
Contributor Author

moio commented Sep 12, 2024

@tomleb I am very glad you brought this up as it gave me the opportunity to learn things I did not know.

"select+update different table" behavior seemed obscure/broken to me for a few days, until I found a tangentially related article that made it clear why it behaves that way:

  • as it is well-known, SQLite only supports one active "writer" at any point in time. I find "writer" a bit of nebulous shorthand for "writing transaction" (which is definitely a SQLite-specific concept)
  • in SQLite a transaction is "writing" if it is either started with BEGIN IMMEDIATE or if it already issued a DML statement (INSERT/UPDATE/DELETE). Note that normal (non-IMMEDIATE) transactions start their life as "non-writing" and can be promoted to "writing" at a later point
  • in any situation, if a "writing" transaction starts (either because it BEGINs IMMEDIATEly or because it got promoted) while another is still in progress, then the SQLITE_BUSY (5) error is produced, that transaction gets into an error state and, like any other errored transaction, it can't be recovered. The only way to proceed is to ROLLBACK and then start with a fresh transaction
  • because everybody hates rolling back, as a convenience, SQLite implements the busy_timeout PRAGMA. That implements code that conceptually catches SQLITE_BUSY errors, issues ROLLBACKs, waits and BEGINs again up to a specified timeout - all transparently! Pretty convenient.
  • problem is that busy_timeout cannot work in all cases. Conceptually, rolling back and retrying a transaction is perfectly OK if that transaction only issued BEGIN IMMEDIATE and nothing else (because literally nothing else happened, there is nothing to roll back). busy_timeout is also nice enough to handle transactions that started as non-writer but did nothing but a DML command which immediately ran into SQLITE_BUSY (5) - effectively, those transactions did nothing either. But if a transaction started as "non-writer", SELECTed something, returned data back to application code, and after that it became a writer and got SQLITE_BUSY (5), then there is no safe way a ROLLBACK can possibly be handled from inside SQLite - it has to bubble up to application code, which is why busy_timeout is "not respected".

Fundamentally busy_timeout has to be seen as a mechanism that saves you from the pain of retrying transactions in your code in easy cases, not as a general parameter that is guaranteed to be always respected 😢.

Back to our options:

  1. using immediate transactions exclusively: will not work. Specifically: it will be functional, but there will be no reader/reader or reader/writer parallelism, as we would be effectively limited to one active transaction at any point in time
  2. using a single connection for writes: it will work, but you will need to take care when starting a transaction to use the right connection depending on whether you plan to use it to write or not, and synchronize access, queuing up as necessary. Might be a bit complicated
  3. moving retry logic elsewhere: it will also work - provided we retry whole transactions and not commands inside of a transaction (that is doomed: once a transaction is in an error state, it can't get out of it). Might also not be super-straightforward, but has the advantage of not needing to think when opening a transaction whether it will be read-only or read-write
  4. using BEGIN IMMEDIATE for all writing transactions - potentially the least intrusive fix, and allows to forego all retrying code. Requires to think before BEGINning whether the transaction will write or not.

Unless you disagree or see flaws in the reasoning above, I'd start by trying to go with option 3 in a separate PR and see how it goes, as it seems to minimize the complexity of the finished thing.

@tomleb
Copy link
Contributor

tomleb commented Sep 13, 2024

@moio thanks for letting me know about this article, it's got really good information!

I'm okay with you trying out option 3.

I'm tempted to try option 2 & 4 together myself and see how that turns out. (Inspired by the rqlite codebase) Naively, I'm thinking the interface is already defined for us so we know what needs read-only and read-write. Though of course as code changes.. I suspect some non-mock test would be able to find issues here such as trying to write to a database using a read-only connection..

@moio
Copy link
Contributor Author

moio commented Sep 13, 2024

🤦 I wrote option 3. while meaning option 4. - as I added a point mid editing 😮‍💨

Are you still OK with that?

Unrelated: I absolutely do not oppose you going down the option 2. route. In my eyes, once that is done and correctly enqueues write transactions, adding 4. on top is not even needed. Let me know what you decide in the end.

@tomleb
Copy link
Contributor

tomleb commented Sep 13, 2024

Are you still OK with that?

Yes, definitely.

adding 4. on top is not even needed

Heh you're right, I hadn't thought of that. Once we limit writing to a single connection, then deferred mode cannot fail with database is locked.

Played a bit with option 4

I'm seeing some weird behavior when opening the DB with default transaction mode immediate. BeginTx will not respect busy_timeout although it should. Running BEGIN IMMEDIATE manually does work. So if you're going with option 4 I suggest keeping the default transaction mode of deferred and using BEGIN IMMEDIATE manually. Unless you had a better idea. Here's the snippet describing this:

package main

import (
	"context"
	"database/sql"
	"log"
	"os"
	"os/signal"
	"syscall"
	"time"

	_ "modernc.org/sqlite"
)

func main() {
	if err := mainErr(); err != nil {
		log.Fatal(err)
	}
}

func mainErr() error {
	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
	defer stop()

	// cache=shared is necessary for in-memory DB
	rwDB, err := sql.Open("sqlite", "file::memory:?cache=shared&_pragma=journal_mode=wal&_txlock=immediate&_pragma=busy_timeout=5000")
	if err != nil {
		return err
	}
	rwDB.SetMaxOpenConns(2)

	conn, _ := rwDB.Conn(ctx)
	tx, err := conn.BeginTx(ctx, nil)
	if err != nil {
		return err
	}

	go func() {
		conn2, _ := rwDB.Conn(ctx)
		// BeginTx will fail with database is locked error even though
		// we set immediate transaction as default
		_, err2 := conn2.BeginTx(ctx, nil)
		// Running BEGIN IMMEDIATE manually gives us the behavior we
		// want: busy_timeout is respected
		//_, err2 := conn2.ExecContext(ctx, "BEGIN IMMEDIATE;")
		if err2 != nil {
			log.Fatal(err2)
		}
	}()
	time.Sleep(10 * time.Second)

	err = tx.Commit()
	if err != nil {
		return err
	}

	return nil
}

modernc's implementation of SQLite uses a slightly different syntax

Signed-off-by: Silvio Moioli <[email protected]>
120 seconds, or two consecutive minutes of uninterrupted contention,
should be more than sufficient and anything exceeding this value
should be reported as error.

Signed-off-by: Silvio Moioli <[email protected]>
With busy_timeout properly set, SQLite will take care of
the error internally.

Moreover with WAL mode properly set, the error should
almost never happen to begin with.

https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

Signed-off-by: Silvio Moioli <[email protected]>
tomleb
tomleb previously approved these changes Sep 13, 2024
Copy link
Contributor

@tomleb tomleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm still not sure whether I prefer having 2 separate connection pool (1 read-write, 1 read-only) or using begin immediate. Immediate transaction mode has a footgun (imo) with modernc.org/sqlite (and github.com/mattn/go-sqlite3): busy_timeout is NOT respected when using cache=shared and executing BeginTx, but it IS respected when using cache=shared and executing ExecContext("begin immediate"). We're not using cache=shared though, but if we ever want to use :memory: databases for tests, we will have to use it, which will lead to very different behavior between tests and production.

As I understand it, 2 separate connection pools would avoid this problem since the read-write connection would be blocked by the application instead.

@moio
Copy link
Contributor Author

moio commented Sep 17, 2024

busy_timeout is NOT respected when using cache=shared and executing BeginTx, but it IS respected when using cache=shared and executing ExecContext("begin immediate")

I am not exactly sure this is actually the case - as BeginTx with the change in https://github.com/rancher/lasso/pull/98/files#diff-28f84babc5eeebb6a449367653bf97f8670e2816ffd8f898eb592dd3819285b0R236 actually calls BEGIN IMMEDIATE so the two should be equivalent.

In any case, I would not recommend going down the :memory: lane for tests - there are enough subtle differences in terms of locking and performance with the file-based approach, and the latter is anyway so fast, I would recommend running integration tests in files for the sake of uniformity and debuggability anyway.

@tomleb
Copy link
Contributor

tomleb commented Sep 17, 2024

busy_timeout is NOT respected when using cache=shared and executing BeginTx, but it IS respected when using cache=shared and executing ExecContext("begin immediate")

I am not exactly sure this is actually the case - as BeginTx with the change in https://github.com/rancher/lasso/pull/98/files#diff-28f84babc5eeebb6a449367653bf97f8670e2816ffd8f898eb592dd3819285b0R236 actually calls BEGIN IMMEDIATE so the two should be equivalent.

Yes, should but once you actually test it they end up not being equivalent. BeginTx even with _txlock set will call BEGIN IMMEDIATE differently (eg: with different sqlite functions) than calling ExecContext("BEGIN IMMEDIATE"). (And to be clear, I did a few tests and that's the behavior I'm seeing using both modernc.org and mattn/go-sqlite3 libraries)

In any case, I would not recommend going down the :memory: lane for tests - there are enough subtle differences in terms of locking and performance with the file-based approach, and the latter is anyway so fast, I would recommend running integration tests in files for the sake of uniformity and debuggability anyway.

Fair, though we might end up having to be able to set the path to the DB to avoid having multiple tests.

Anyway, nothing to change here, I'm good with the current solution 👍

Copy link
Contributor

@MbolotSuse MbolotSuse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since this has the two required approvers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants