Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgx TestTxCommitSerializationFailure seems to get stuck waiting on a lock #60754

Closed
rafiss opened this issue Feb 18, 2021 · 13 comments
Closed

pgx TestTxCommitSerializationFailure seems to get stuck waiting on a lock #60754

rafiss opened this issue Feb 18, 2021 · 13 comments
Labels
C-investigation Further steps needed to qualify. C-label will change. no-issue-activity X-stale

Comments

@rafiss
Copy link
Collaborator

rafiss commented Feb 18, 2021

The pgx driver has a test that verifies that a 40001 error is returned on serialization failure. See https://github.com/jackc/pgx/blob/927a15124e6a579327ea56b478e4ba7dac6e45a6/tx_test.go#L146

When running against CockroachDB, however, the database gets stuck executing a query instead of returning the 40001 error code.

Confusingly, when I try to reproduce this by running the commands from two different CLIs, the 40001 error is returned as expected.

For this issue, we should determine why the pgx test gets stuck.

Here are the statements executed by the test:

Connection 1 Connection 2
drop table if exists tx_serializable_sums;
create table tx_serializable_sums(num integer);
begin isolation level serializable;
begin isolation level serializable;
insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums;
insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums;
commit;
commit; -- serialization error is returned here, as expected

And here is the test code, which gets stuck at the second INSERT. you can run SHOW CLUSTER QUERIES after invoking the test and see that the query is running.

import (
	"context"
	"testing"

	"github.com/jackc/pgconn"
	"github.com/jackc/pgx/v4"
	"github.com/stretchr/testify/require"
)

func mustConnectString(t testing.TB, connString string) *pgx.Conn {
	conn, err := pgx.Connect(context.Background(), connString)
	if err != nil {
		t.Fatalf("Unable to establish connection: %v", err)
	}
	return conn
}

func closeConn(t testing.TB, conn *pgx.Conn) {
	err := conn.Close(context.Background())
	if err != nil {
		t.Fatalf("conn.Close unexpectedly failed: %v", err)
	}
}

func TestTxCommitSerializationFailure(t *testing.T) {
	c1 := mustConnectString(t, "postgresql://root:@localhost:26257/defaultdb")
	defer closeConn(t, c1)

	c2 := mustConnectString(t, "postgresql://root:@localhost:26257/defaultdb")
	defer closeConn(t, c2)

	c1.Exec(context.Background(), `drop table if exists tx_serializable_sums`)
	_, err := c1.Exec(context.Background(), `create table tx_serializable_sums(num integer);`)
	if err != nil {
		t.Fatalf("Unable to create temporary table: %v", err)
	}
	defer c1.Exec(context.Background(), `drop table tx_serializable_sums`)

	tx1, err := c1.BeginTx(context.Background(), pgx.TxOptions{IsoLevel: pgx.Serializable})
	if err != nil {
		t.Fatalf("Begin failed: %v", err)
	}
	defer tx1.Rollback(context.Background())

	tx2, err := c2.BeginTx(context.Background(), pgx.TxOptions{IsoLevel: pgx.Serializable})
	if err != nil {
		t.Fatalf("Begin failed: %v", err)
	}
	defer tx2.Rollback(context.Background())

	_, err = tx1.Exec(context.Background(), `insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums`)
	if err != nil {
		t.Fatalf("Exec failed: %v", err)
	}

        // STATEMENT BELOW NEVER FINISHES
	_, err = tx2.Exec(context.Background(), `insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums`)
	if err != nil {
		t.Fatalf("Exec failed: %v", err)
	}

	err = tx1.Commit(context.Background())
	if err != nil {
		t.Fatalf("Commit failed: %v", err)
	}

	err = tx2.Commit(context.Background())
	if pgErr, ok := err.(*pgconn.PgError); !ok || pgErr.Code != "40001" {
		t.Fatalf("Expected serialization error 40001, got %#v", err)
	}

Jira issue: CRDB-3144

@rafiss rafiss added the C-investigation Further steps needed to qualify. C-label will change. label Feb 18, 2021
@tbg
Copy link
Member

tbg commented Feb 19, 2021

The second insert will block on the first insert and can only commence once tx1 commits. Think of it this way:

var mu sync.Mutex

mu.Lock() // tx1
mu.Lock() // tx2
mu.Unlock // tx1
mu.Unlock // tx2

Obviously this ordering is not going to complete as we deadlock on tx2's Lock() invocation.

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 19, 2021

@tbg that makes sense to me, but then why does it not block when I run the same commands from the CLI as described above? am I missing something obvious?

also, running the same test against Postgres in SERIALIZABLE mode does not block. do we implement SERIALIZABLE differently?

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 19, 2021

OK, so if I run the below commands in two CLI shells repeatedly, I do eventually repro the behavior where the second connection is stuck waiting.

Connection 1 Connection 2
begin isolation level serializable;
begin isolation level serializable;
insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums;
insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums;
commit;
commit;

So questions:

  • Why does the PGX test get stuck right away, whereas I have to run it a few times from the CLI to see that behavior? (is the difference something to do with the table being empty?)
  • More specifically, what can I look at to figure out why for myself?
  • Is it expected that CockroachDB would deadlock here and PostgreSQL (with SERIALIZABLE txns) does not?

@ajwerner
Copy link
Contributor

Cockroach should never deadlock. Any deadlock is a serious bug. Does connection 1 actually send that commit? Stack traces and logs from a a repro will help.

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 19, 2021

I don't think it's a deadlock. As I say above, when running that test against CockroachDB, connection 1 does not send that commit.

        // STATEMENT BELOW NEVER FINISHES
	_, err = tx2.Exec(context.Background(), `insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums`)
	if err != nil {
		t.Fatalf("Exec failed: %v", err)
	}

The test does not continue past that point.

To reiterate, my questions are:

  • Why does connection 2 not get stuck at the same point when I run the same things from the CLI?
  • Why does PostgresSQL not get stuck at all, even when running the same test code at the SERIALIZABLE isolation level?

Is this all expected? Should we modify that PGX test so that it runs the commands in separate goroutines?

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 19, 2021

Here's a debug zip though. debug.zip

@ajwerner
Copy link
Contributor

Postgres's locking behavior is different from cockroach's. Postgres figures out serialization violations later, at commit time. We're more pessimistic IIUC. We can't assume that they'll behave exactly the same. For example, I don't think pg ever blocks on reads.

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 20, 2021

OK, that part is more clear to me now, thanks!

I'd still like to understand why running the test against against CRDB blocks, while running the same commands against CRDB manually from the CLI does not block.

@tbg
Copy link
Member

tbg commented Feb 22, 2021

I can't reproduce the behavior you're seeing @rafiss. I used this logic test:

# LogicTest: 5node-pretend59315IsFixed
#
# Verify that ContentionEvents are emitted. This is mostly a sanity check - look
# for the datadriven tests in `pkg/kv/kvserver/concurrency` for the actual events
# that do get emitted in various contention scenarios.

statement ok
GRANT ADMIN TO testuser

statement ok
create table tx_serializable_sums(num integer)

statement ok
BEGIN

statement ok
insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums

user testuser

statement ok
BEGIN

statement error due to statement timeout
SET statement_timeout = '3s';
insert into tx_serializable_sums(num) select sum(num)::int from tx_serializable_sums

statement ok
ROLLBACK

user root

statement ok
COMMIT


I can only imagine that when you're reproing manually, the order isn't the same (though I also think the txns are simple enough for it to not be screwed up?)

Also, isn't that SQL in the original test invalid? What is insert into tx_serializable_sums(num)::int? That certainly doesn't work in the logic test (I fixed it above).

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 22, 2021

Oops, yes, I put the cast on the wrong expression in my code sample. The commands I put in the table in the issue description has the cast in the right place. (Editing the post now to fix my code sample too.)

Though @tbg , one difference in your logic test repro and my manual CLI repro is that the logic test sends the INSERT immediately after the BEGIN. In my CLI repro, I do (1) BEGIN; (2) BEGIN; (1) INSERT..., (2) INSERT..., (1) COMMIT, (2) COMMIT

Does interleaving BEGINs like this make any difference? Maybe not, since it looks like the pgx test interleaves them as well.

@tbg
Copy link
Member

tbg commented Feb 23, 2021

It should not matter. What matters is that (2) INSERT happens after (1) INSERT but before (1) COMMIT. That way you will always have the second write block on the first one, and since the first one will never disappear on its own it's a deadlock.

@tbg
Copy link
Member

tbg commented Feb 23, 2021

(More details: (2) INSERT contains a full table scan, and it's really that read that blocks on (1) INSERT). But a read can't ignore an intent at a lower timestamp, and the intent would definitely be at a lower timestamp since it was laid down earlier.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

3 participants