Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/index/tpcc/w=1000 failed #68958

Closed
cockroach-teamcity opened this issue Aug 14, 2021 · 3 comments
Closed

roachtest: schemachange/index/tpcc/w=1000 failed #68958

cockroach-teamcity opened this issue Aug 14, 2021 · 3 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on release-21.1 @ 22dad757f6f5ba0d0a10ce3ccdf9712e54cf1a56:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 3: dead (exit status 134)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 4: dead (exit status 134)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 4: dead (exit status 134)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (5) 2: dead (exit status 134)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 14, 2021
@cockroach-teamcity cockroach-teamcity added this to the 21.1 milestone Aug 14, 2021
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Aug 14, 2021
@ajwerner
Copy link
Contributor

Seems like the same deadlock as #68951.

@ajwerner ajwerner self-assigned this Aug 17, 2021
craig bot pushed a commit that referenced this issue Aug 19, 2021
69040: sql: fix deadlock when updating backfill progress r=ajwerner a=ajwerner

The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in #68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in #68244 has never been released.

Touches #68951, #68958.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/index/tpcc/w=1000 failed with artifacts on release-21.1 @ c425111e138297bcacd1370cbe40263dd00e64ac:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/schemachange/index/tpcc/w=1000/run_1
	test_runner.go:792: test timed out (6h0m0s)

	schemachange.go:478,schemachange.go:308,cluster.go:2666,errgroup.go:57: dial tcp 35.192.186.120:26257: connect: connection refused

	cluster.go:2688,tpcc.go:162,schemachange.go:302,test_runner.go:733: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.runTPCC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:162
		  | main.makeIndexAddTpccTest.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/schemachange.go:302
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:733
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2732
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2646
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh schemachange/index/tpcc/w=1000

Same failure on other branches

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

@ajwerner
Copy link
Contributor

Should be fixed by #69130.

@healthy-pod healthy-pod added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

3 participants