Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: aborted transactions can end up being used for prepare #80764

Closed
fqazi opened this issue Apr 29, 2022 · 17 comments
Closed

sql: aborted transactions can end up being used for prepare #80764

fqazi opened this issue Apr 29, 2022 · 17 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@fqazi
Copy link
Collaborator

fqazi commented Apr 29, 2022

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

An assertion or other weird behaviour can be encountered because the Prepare command no longer starts a new implicit transaction if the current transaction is in an aborted state. As a result, users may notice regressions in certain scenarios, such as the example below.

To Reproduce
This was originally observed on the randomized schema change workload, and the exact sequence isn't known. A synthetic repro can be done with the following Go executable and modifying the code to intentionally close the transaction inside: execStmtInAbortedState (though I'm not sure if that is fully valid).

package main

import (
	"context"
	"fmt"
	"log"
	"os"

	"github.com/jackc/pgx/v4"
)

func main() {
	// Read in connection string
	conn, err := pgx.Connect(context.Background(), os.Getenv("DATABASE_URL"))
	if err != nil {
		log.Fatal(err)
	}
	defer conn.Close(context.Background())

	conn2, err := pgx.Connect(context.Background(), os.Getenv("DATABASE_URL"))
	if err != nil {
		log.Fatal(err)
	}
	defer conn2.Close(context.Background())

	ctx := context.Background()
	tx2, err := conn2.Begin(ctx)
	_, err = tx2.Exec(ctx, "select * from t1;")
	fmt.Printf("%v\n", err)

	tx, err := conn.Begin(ctx)
	_, err = tx.Exec(ctx, "select * from t1;")
	fmt.Printf("%v\n", err)

	_, err = tx2.Exec(ctx, "insert into t1 values(5)")
	fmt.Printf("%v\n", err)

	_, err = tx.Exec(ctx, "select * from t1;")
	fmt.Printf("===%v\n", err)

	_, err = tx.Exec(ctx, "select * from t1;")
	fmt.Printf("===%v\n", err)

	_, err = conn.Prepare(ctx, "stmt1", "select * from [show regions]")
	fmt.Printf("%v\n", err)
}

Jira issue: CRDB-15526

@fqazi fqazi added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Apr 29, 2022
@fqazi fqazi closed this as completed Apr 29, 2022
@fqazi fqazi reopened this Apr 29, 2022
@otan
Copy link
Contributor

otan commented May 1, 2022

An assertion or other weird behaviour can be encountered because the Prepare command no longer starts a new implicit transaction if the current transaction is in an aborted state

It's strange to me that this is a relied upon behaviour - do we want PREPARE to work in a failed txn? in PG, we definitely do not allow this to work:

otan=# begin;
BEGIN
otan=*# select * from t3333;
ERROR:  relation "t3333" does not exist
LINE 1: select * from t3333;
                      ^
otan=!# prepare e as select 1;
ERROR:  current transaction is aborted, commands ignored until end of transaction block
otan=!#

As a result, users may notice regressions in certain scenarios, such as the example below.

I'm not sure what the "error" that surprises you is, can you elaborate


I'm reading the commit message from dda2fa9, and to me it seems "deliberate" we no longer start an implicit txn. i'm kind of confused about why we wanted this to work in the first place.

overall, i'm not sure if i believe this is an issue - @fqazi can you elaborate based on the questions above? cheers!

@fqazi
Copy link
Collaborator Author

fqazi commented May 2, 2022

That error is acceptable, I noticed a case where we decided to proceed with a closed transaction and hit an assertion planning (checking authorization). Unfortunately, I haven't had luck reproducing, since it was in a randomized workload.

@otan
Copy link
Contributor

otan commented May 2, 2022

are we sure it is related to that commit? if so, did it come down to a bisect?

I noticed a case where we decided to proceed with a closed transaction and hit an assertion planning

is there a stack trace or anything more specific (e.g. precise error message) you could provide?

@otan
Copy link
Contributor

otan commented May 2, 2022

i'm also tempted to remove this as a release blocker if it's difficult to reproduce. it sounds like the assertion failure is a "safe case" to hit (i.e. won't cause data corruption or unavailability of the database as a whole), unless i am mistaken.

@fqazi
Copy link
Collaborator Author

fqazi commented May 2, 2022

We weren't able to confirm the commit, and even on randomized, I haven't hit it again. I'm okay with removing the release blocker from this, since it's not easy to hit (and probably some weird scenario). I only have a stale stack trace that is incomplete from when it was hit:

stack trace: github.com/cockroachdb/cockroach/pkg/sql/authorization.go:129: CheckPrivilegeForUser() github.com/cockroachdb/cockroach/pkg/sql/authorization.go:168: CheckPrivilege() github.com/cockroachdb/cockroach/pkg/sql/opt_catalog.go:301: CheckPrivilege() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/util.go:649: checkPrivilege() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/util.go:611: resolveDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:104: buildDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:60: buildDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1155: buildFromTablesRightDeep() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1132: buildFromTables() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1059: buildFrom() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:980: buildSelectClause() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:928: buildSelectStmtWithoutParens() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:901: func1() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/with.go:115: processWiths() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:900: buildSelect() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:835: buildSelectStmt() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:148: buildDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:60: buildDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/join.go:36: buildJoin() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:72: buildDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/join.go:36: buildJoin() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:72: buildDataSource() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1155: buildFromTablesRightDeep() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1132: buildFromTables() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1059: buildFrom() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:980: buildSelectClause() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:928: buildSelectStmtWithoutParens() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:901: func1() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/with.go:115: processWiths() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:900: buildSelect() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:281: buildStmt() github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:241: buildStmtAtRoot()

@otan otan removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 2, 2022
@otan
Copy link
Contributor

otan commented May 2, 2022

hmm, that stack trace doesn't seem related to the prepare path. are we sure that's related?

@otan otan removed branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 deprecated-branch-release-22.1.0 labels May 2, 2022
@fqazi
Copy link
Collaborator Author

fqazi commented May 2, 2022

It's partial, the bottom part was in Prepare. Pretty much it was a schema changer workload query with SHOW REGIONS being prepared

@fqazi
Copy link
Collaborator Author

fqazi commented May 5, 2022

I ran into this again, and the full-stack below, on a experimental branch with the schemachanger workload with simple selects enabled (there is an application bug here too) :

I220505 18:18:51.993656 6250 9@util/log/event_log.go:32  [n1,client=127.0.0.1:64763,user=testuser] 4942 ={"Timestamp":1651774731826830000,"EventType":"query_execute","Statement":"SELECT * FROM \"\".public.table36 AS t0","Tag":"SELECT","User":"testuser","ApplicationName":"schemachange","ExecMode":"exec","SQLSTATE":"40001","ErrorText":"TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): \"sql txn\" meta={id=eac6f725 key=/Table/24/1/4/123/0 pri=0.00840388 epo=0 ts=1651774731.827449000,1 min=1651774731.826830000,0 seq=23} lock=true stat=ABORTED rts=1651774731.826830000,0 wto=false gul=1651774732.326830000,0","Age":101.409,"FullTableScan":true,"TxnCounter":10}
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943  encountered internal error:
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +cannot use CheckPrivilege without a txn
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +(1) assertion failure
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +Wraps: (2) attached stack trace
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  -- stack trace:
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql.(*planner).CheckPrivilegeForUser
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/authorization.go:129
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql.(*planner).CheckPrivilege
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/authorization.go:168
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql.(*optCatalog).CheckPrivilege
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt_catalog.go:301
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).checkPrivilege
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/util.go:652
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).resolveDataSource
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/util.go:614
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildDataSource
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:105
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildDataSource
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:61
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildFromTablesRightDeep
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1156
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildFromTables
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1133
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildFrom
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1060
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelectClause
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:981
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelectStmtWithoutParens
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:929
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelect.func1
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:902
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).processWiths
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/with.go:115
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelect
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:901
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:282
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:374
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildDataSource
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:164
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildDataSource
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:61
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildFromTablesRightDeep
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1156
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildFromTables
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1133
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildFrom
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:1060
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelectClause
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:981
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelectStmtWithoutParens
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:929
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelect.func1
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:902
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).processWiths
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/with.go:115
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildSelect
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/select.go:901
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:282
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmtAtRoot
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:242
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).Build
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder/builder.go:216
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).buildReusableMemo
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:400
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | github.com/cockroachdb/cockroach/pkg/sql.(*planner).prepareUsingOptimizer
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +  | 	/Users/faizanq/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:143
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +Wraps: (3) cannot use CheckPrivilege without a txn
E220505 18:18:51.994246 6250 sql/sqltelemetry/report.go:57  [n1,client=127.0.0.1:64763,user=testuser] 4943 +Error types: (1) *assert.withAssertionFailure (2) *withstack.withStack (3) *errutil.leafError
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944  pgx logger [error]: Query logParams=map[args:[] err:ERROR: internal error: cannot use CheckPrivilege without a txn (SQLSTATE XX000) pid:2035951 sql:
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944 +   SELECT sequence_schema, sequence_name
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944 +     FROM [SHOW SEQUENCES]
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944 +    WHERE sequence_name LIKE 'seq%%'
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944 + ORDER BY random()
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944 +		LIMIT 1;
I220505 18:18:51.994794 6145 workload/pgx_helpers.go:79  [-] 4944 +		]

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 5, 2022

At least one of the causes is #85677, where we are swallowing transaction errors, so not really SQL's fault. I'll see what else shakes out with debug code

@rafiss
Copy link
Collaborator

rafiss commented Aug 11, 2022

Nice @fqazi. Can you please share your method for how you found that? This bug has been hard to track down, and the same problem could exist in more places, so it would be good if it was more widely understood how we can find these other places.

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 11, 2022

We added tracing inside the workload, which gets dumped when any failure gets hit. This allows us to see where the transaction retry error was hit using heuristics (i.e. last query on the connection before the retry is hit). It's not perfect and specific to the workload, but maybe some lightweight tracing of internal executor uses that's always collected might be worth it for a similar family of errors.

@rafiss
Copy link
Collaborator

rafiss commented Aug 11, 2022

Are your workload tracing changes merged in, or custom? Can you please provide full instructions so I can do this myself?

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 11, 2022

They aren't merged in yet, I'll get a PR for them but they are specific for the randomized schema changer workload (i.e. the client side code for the workload does this).

@rafiss
Copy link
Collaborator

rafiss commented Aug 11, 2022

You can just post a git diff here of the changes you made, then give me the commands to run.

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 11, 2022

So, patch the workload with:

diff --git a/pkg/workload/schemachange/schemachange.go b/pkg/workload/schemachange/schemachange.go
index 93f3d1240e..e39ea75f91 100644
--- a/pkg/workload/schemachange/schemachange.go
+++ b/pkg/workload/schemachange/schemachange.go
@@ -397,10 +397,19 @@ func (w *schemaChangeWorker) runInTxn(ctx context.Context, tx pgx.Tx) error {
 }

 func (w *schemaChangeWorker) run(ctx context.Context) error {
-	tx, err := w.pool.Get().Begin(ctx)
+	conn, err := w.pool.Get().Acquire(ctx)
 	if err != nil {
 		return errors.Wrap(err, "cannot get a connection and begin a txn")
 	}
+	defer conn.Release()
+	_, err = conn.Exec(ctx, "SET tracing=off; SET tracing= kv,on")
+	if err != nil {
+		return errors.Wrap(err, "cannot enable trace")
+	}
+	tx, err := conn.Begin(ctx)
+	if err != nil {
+		return errors.Wrap(err, "cannot begin a txn")
+	}

 	// Release log entry locks if holding all.
 	defer w.releaseLocksIfHeld()
@@ -429,6 +438,18 @@ func (w *schemaChangeWorker) run(ctx context.Context) error {
 		switch {
 		case errors.Is(err, errRunInTxnFatalSentinel):
 			w.preErrorHook()
+			rows, qerr := conn.Query(ctx, "SELECT message from [SHOW TRACE FOR SESSION]")
+			if qerr == nil {
+				for rows.Next() {
+					var message string
+					qerr := rows.Scan(&message)
+					if qerr != nil {
+						panic(qerr)
+					}
+					fmt.Printf("KV TRACE : %s\n", message)
+				}
+				rows.Close()
+			}
 			return err
 		case errors.Is(err, errRunInTxnRbkSentinel):
 			// Rollbacks are acceptable because all unexpected errors will be
@@ -474,6 +495,19 @@ func (w *schemaChangeWorker) run(ctx context.Context) error {
 		// Check for any expected errors.
 		if !w.opGen.expectedCommitErrors.contains(pgcode.MakeCode(pgErr.Code)) &&
 			!w.opGen.potentialCommitErrors.contains(pgcode.MakeCode(pgErr.Code)) {
+			_ = tx.Rollback(ctx)
+			rows, qerr := conn.Query(ctx, "SELECT message from [SHOW TRACE FOR SESSION]")
+			if qerr == nil {
+				for rows.Next() {
+					var message string
+					qerr := rows.Scan(&message)
+					if qerr != nil {
+						panic(qerr)
+					}
+					fmt.Printf("KV TRACE : %s\n", message)
+				}
+				rows.Close()
+			}
 			err = errors.Mark(
 				errors.Wrapf(err, "***UNEXPECTED COMMIT ERROR; Received an unexpected commit error %s", w.getErrorState()),
 				errRunInTxnFatalSentinel,

You can execute the workload using: COCKROACH_DEV_LICENSE=$(curl -f "https://register.cockroachdb.com/api/prodtest") bin/roachtest run schemachange/random-load --count 8

@rafiss
Copy link
Collaborator

rafiss commented Aug 11, 2022

Great, thanks. I have a theory that #85996 will actually give us more helpful error messages. I'll try using your repro to find out.

craig bot pushed a commit that referenced this issue Aug 12, 2022
85996: sql: don't panic on auth check with unopen txn r=ajwerner,RichardJCai a=rafiss

Relates to #80764 and #82034

Ever since c00ea84 was merged, the KV
layer has disallowed use of an aborted txn. Therefore, the checks here
are no longer necessary.

This should actually help with debugging, since now if the aborted txn
is used, we should get back an error that has the reason for the abort
(or restart), instead of a panic that does not have that info.

Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
craig bot pushed a commit that referenced this issue Sep 9, 2022
87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner

Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing.

May deal with #85677
relates to #80764

Release note: None

87646: backupccl: deflake backup tests r=adityamaru a=adityamaru

See individual commits.

Release note: None

87667: server: clean up some logging r=ajwerner a=ajwerner

For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly.

Release note: None

87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg

Closes #87581.

Release note: None


87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart

Closes #87700.
Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@rafiss
Copy link
Collaborator

rafiss commented Oct 12, 2022

I'm going to close this as per the previous comment

@rafiss rafiss closed this as completed Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

5 participants