Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv50/enc=false/nodes=4/cpu=96/batch=64 failed [SendError bubbles up] #69419

Closed
cockroach-teamcity opened this issue Aug 26, 2021 · 25 comments · Fixed by #69961
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ ab1fc343c9a1140191f96353995258e609a84d02:

		  | I210826 14:26:22.303165 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
		  | I210826 14:26:22.332804 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210826 14:26:31.980697 643 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210826 14:26:31.981413 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210826 14:26:32.069848 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 88.434076ms)
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r315: replica (n4,s4):3 not lease holder; current lease is repl=(n3,s3):1 seq=4 start=1629987984.434423504,48 epo=1 pro=1629987984.434423504,45 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 26, 2021
@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 26, 2021
@tbg tbg changed the title roachtest: kv50/enc=false/nodes=4/cpu=96/batch=64 failed roachtest: kv50/enc=false/nodes=4/cpu=96/batch=64 failed [SendError bubbles up] Aug 26, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:

		  | I210827 11:06:56.083448 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
		  | I210827 11:06:56.110666 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210827 11:07:04.125043 669 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210827 11:07:04.125758 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210827 11:07:04.217656 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 91.896941ms)
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r212: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=5 start=1630062418.179889391,0 epo=1 pro=1630062418.179841008,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 44ea1fa0eba8fc78544700ef4afded62ab98a021:

		  | I210828 11:29:24.947587 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
		  | I210828 11:29:24.976838 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210828 11:29:33.529682 653 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210828 11:29:33.530441 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210828 11:29:33.609252 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 78.810482ms)
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r507: replica (n1,s1):1 not lease holder; current lease is repl=(n4,s4):3 seq=4 start=1630150168.631056505,0 epo=1 pro=1630150168.630986488,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Aug 28, 2021

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28f1142]

goroutine 34401 [running]:
panic(0x4ab4b60, 0x8ebc610)
	/usr/local/go/src/runtime/panic.go:1065 +0x565 fp=0xc00b231c18 sp=0xc00b231b50 pc=0x48b265
runtime.panicmem()
	/usr/local/go/src/runtime/panic.go:212 +0x5b fp=0xc00b231c38 sp=0xc00b231c18 pc=0x4892bb
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:734 +0x173 fp=0xc00b231c70 sp=0xc00b231c38 pc=0x4a2fb3
github.com/cockroachdb/cockroach/pkg/sql/colexecop.Closers.CloseAndLogOnErr(0xc00b987c80, 0x1, 0x1, 0x69031d8, 0xc00c18e1b0, 0xc010a11500, 0x26)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecop/operator.go:157 +0xc2 fp=0xc00b231d08 sp=0xc00b231c70 pc=0x28f1142
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*ParallelUnorderedSynchronizer).init.func1.1(0xc00c18e1e0, 0xc003bfc300, 0x7f30d567e4f0, 0xc002de6340, 0xc00b987cc0, 0x1, 0x1, 0xc00b987c60, 0x1, 0x1, ...)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:227 +0x9e fp=0xc00b231d50 sp=0xc00b231d08 pc=0x2cddb9e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*ParallelUnorderedSynchronizer).init.func1(0xc003bfc300, 0x7f30d567e4f0, 0xc002de6340, 0xc00b987cc0, 0x1, 0x1, 0xc00b987c60, 0x1, 0x1, 0xc00b987c80, ...)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:278 +0xc19 fp=0xc00b231f78 sp=0xc00b231d50 pc=0x2cde8b9
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc00b231f80 sp=0xc00b231f78 pc=0x4c5cc1
created by github.com/cockroachdb/cockroach/pkg/sql/colexec.(*ParallelUnorderedSynchronizer).init
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:216 +0x149

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 28, 2021
@tbg
Copy link
Member

tbg commented Aug 28, 2021

I do wonder why this wasn't picked up by the dead node detector:

teardown: 14:26:32 cluster.go:519: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3359107-1629959017-96-n5cpu96 --oneshot --ignore-empty-nodes

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 6700cf65e82a029a1ff76a75250c709aa9637170:

		  | I210829 13:31:16.325701 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
		  | I210829 13:31:16.350689 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210829 13:31:25.114599 643 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210829 13:31:25.115458 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210829 13:31:25.197668 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 82.208305ms)
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r1038: replica (n3,s3):1 not lease holder; current lease is repl=(n1,s1):4 seq=6 start=1630243884.173452106,0 epo=1 pro=1630243884.173374808,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

yuzefovich commented Aug 30, 2021

After fixing the issue with not waiting for concurrent goroutines to exit in the vectorized engine, I now see the following when running this roachtest:

| I210830 21:22:15.612671 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
| I210830 21:22:23.505297 642 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
| I210830 21:22:23.505986 1 workload/cli/run.go:408  [-] 4  creating load generator...
| I210830 21:22:23.599384 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 93.397537ms)
| Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r338: replica (n3,s3):1 not lease holder; current lease is repl=(n4,s4):2 seq=5 start=1630358538.207335498,0 epo=1 pro=1630358538.207254835,0 (SQLSTATE XXUUU)

As a bit of background, in recently merged #68679 we introduced parallelism of TableReaders if the flow is local. The fact that this roachtest uses batches satisfies the requirements to trigger that parallelism. Under the hood, we use DistSQLPlanner.PartitionSpans to split the spans to scan with the help of the RangeIterator coming from the DistSender. This error seems to occur very quickly, probably when the range cache hasn't been populated yet. So I'm wondering whether we need to disable this TableReader parallelism until the range cache is populated or something like that? Is there way to check that easily? @andreimatei maybe you would know?

@yuzefovich
Copy link
Member

Hm, #69612 in combination with the following diff

diff --git a/pkg/workload/workloadsql/workloadsql.go b/pkg/workload/workloadsql/workloadsql.go
index ab57432d4f..2d22140488 100644
--- a/pkg/workload/workloadsql/workloadsql.go
+++ b/pkg/workload/workloadsql/workloadsql.go
@@ -183,7 +183,9 @@ func Split(ctx context.Context, db *gosql.DB, table workload.Table, concurrency
                        }
                }
                close(splitCh)
-               return nil
+               log.Info(ctx, "populating the range cache")
+               _, err := db.Exec(fmt.Sprintf(`SELECT count(*) FROM %s`, table.Name))
+               return err
        })
        return g.Wait()
 }

makes the roachtest pass.

@yuzefovich
Copy link
Member

Never mind, two commits in #69612 seem to be sufficient.

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ c1ef81f5f435b3cc5bdf8b218532e0779f03a6bf:

		  | I210831 14:31:16.514176 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
		  | I210831 14:31:16.544727 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210831 14:31:24.446707 626 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210831 14:31:24.447385 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210831 14:31:24.537475 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 90.088872ms)
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r166: replica (n4,s4):1 not lease holder; current lease is repl=(n3,s3):3 seq=4 start=1630420277.366149094,0 epo=1 pro=1630420277.366087670,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

Never mind again, #69612 doesn't address the NotLeaseHolderErrors even with explicitly populating the range cache. I also ran into ERROR: result is ambiguous (context canceled) (SQLSTATE 40003).

I'm pretty sure that #68679 is to blame for these issues, so putting a GA-blocker label.

@yuzefovich
Copy link
Member

My current guess is that because we're canceling the context of DistSender.Send once we no longer need the results of that scan, the context cancelation is surfaced back up as these different KV layer errors, but we swallow only the context cancellations at the parallel unordered synchronizer level. As a result, we incorrectly propagate the errors as query errors while we should have swallowed them.

craig bot pushed a commit that referenced this issue Aug 31, 2021
68983: backupccl: stop including historical databases in cluster backup Descs r=adityamaru a=pbardea

A previous commit attempted to fix a bug where cluster backup would not
include tables in dropped databases between incremental backups. That
fixed aimed to find dropped databases and add it to the set of
descriptors. However, this causes issues when a database is recreated
with the same name.

Rather than adding the dropped DBs to the Descriptors field on the
backup manifest, this commit updates how DescriptorChanges are populated
for cluster backups with revision history. Now, the initial scan of
descriptors as of the start time will look for all descriptors in the
cluster rather than just those that were resolved as of the end time of
the backup.

Release note (bug fix): Fix a bug where cluster revision-history backups
may have included dropped descriptors in the "current" snapshot of
descriptors on the cluster.

Release justification: bug fix. Fix a bug where cluster revision-history backups
may have included dropped descriptors in the "current" snapshot of
descriptors on the cluster.

69378: bazel,ci: propagate `TEST_TMPDIR` down to go tests and capture artifacts r=jlinder a=rickystewart

This fulfills a long-standing request to capture left-over artifacts in
`TMPDIR` (see #59045 (comment)).

Bazel sets the `TEST_TMPDIR` environment variable for the temporary
directory and expects all tests to write temporary files to that
directory. In our Go tests, however, we consult the `TMPDIR` environment
variable to find that directory. So we pull in a custom change to
`rules_go` to copy `TEST_TMPDIR` to `TMPDIR`. Update `.bazelrc` to use
`/artifacts/tmp` as the `TEST_TMPDIR`.

Closes #59045.
Closes #69372.

Release justification: Non-production code change
Release note: None

69612: colflow: propagate concurrency info from vectorized to FlowBase r=yuzefovich a=yuzefovich

**colflow: propagate concurrency info from vectorized to FlowBase**

We've recently merged a change to introduce concurrency in the local
flows. Those new concurrent goroutines are started by the vectorized
parallel unordered synchronizer, and `FlowBase` isn't aware of them; as
a result, `FlowBase.Wait` currently might not wait for all goroutines to
exit (which is an optimization when there are no concurrent goroutines).
This commit fixes the problem by propagating the information from the
vectorized flow to the FlowBase.

Addresses: #69419.

Release note: None (no stable release with this bug)

Release justification: bug fix to new functionality.

**sql: loosen up the physical planning of parallel scans**

This commit makes it so that in case we try to plan parallel
TableReaders but encounter an error during planning, we fallback to
having a single TableReader.

Release note: None

Release justification: bug fix to new functionality.

Co-authored-by: Paul Bardea <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member

Adding this diff:

diff --git a/pkg/sql/colexec/parallel_unordered_synchronizer.go b/pkg/sql/colexec/parallel_unordered_synchronizer.go
index 8cfb01c1d7..208679d1af 100644
--- a/pkg/sql/colexec/parallel_unordered_synchronizer.go
+++ b/pkg/sql/colexec/parallel_unordered_synchronizer.go
@@ -13,6 +13,7 @@ package colexec
 import (
        "context"
        "fmt"
+       "strings"
        "sync"
        "sync/atomic"
 
@@ -250,7 +251,9 @@ func (s *ParallelUnorderedSynchronizer) init() {
                                case parallelUnorderedSynchronizerStateRunning:
                                        if err := colexecerror.CatchVectorizedRuntimeError(s.nextBatch[inputIdx]); err != nil {
                                                if s.Ctx.Err() == nil && s.cancelLocalInput[inputIdx] != nil {
-                                                       if errors.Is(err, context.Canceled) || errors.Is(err, cancelchecker.QueryCanceledError) {
+                                                       if strings.Contains(err.Error(), context.Canceled.Error()) ||
+                                                               errors.Is(err, cancelchecker.QueryCanceledError) ||
+                                                               strings.Contains(err.Error(), "routing information detected to be stale") {
                                                                // The input context has been canceled, yet the
                                                                // main context of the synchronizer has not.
                                                                // This indicates that the synchronizer has

seems to make the roachtest pass, but it seems wrong.

@andreimatei @nvanbenschoten do you think it's possible that the context cancellation of the context passed into DistSender.Send would result in routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store error?

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 15b773c71f92d643795e34c922717fde0447f9cd:

		  | I210901 15:40:29.912244 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
		  | I210901 15:40:29.937446 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210901 15:40:38.492532 629 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210901 15:40:38.493199 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210901 15:40:38.599071 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 105.87075ms)
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r468: replica (n4,s4):2 not lease holder; current lease is repl=(n3,s3):1 seq=6 start=1630510833.262891633,0 epo=1 pro=1630510833.262821066,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 42e5f9492d0d8d93638241303bca984fe78baae3:

		  | I210903 14:51:57.080951 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210903 14:52:05.094220 643 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210903 14:52:05.094951 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210903 14:52:05.208300 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 113.347772ms)
		  | I210903 14:52:05.395935 2299 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[-8109847698309837236 5058209292518520260 -5530147469381035334 1747033619382914294 7017418739005512875 -2358014717237621985 -8302341603210547287 -2028468248873085622 -1895163067150467034 -8013597603978460706 8801482429265706073 6351514490197923581 -2637915090719830676 -1045026769944542557 -613598276713862968 7723899643837246137 2776783857895378806 3600706713457807552 4447144560849393602 -6512640214809618088 1888151937232397451 3841920450183009781 8051861797983870891 2079515034499768626 3610234603118268281 -481136611672676824 3262028307754469298 -2025140229427276246 8855883982716194614 -2630471628884419953 772833911832898420 6364432287464617598 5587612854538332705 1127145598566479692 7722713035770714410 -4216484208988152898 8925321881446340715 -1298631130249111985 2852204313533531139 8310639806364282907 1407488118307745822 -2084406318820195014 4204800570268765929 2788087087473709443 6959989646779182367 5495351232278076469 6841592627956384857 -4759154306198077991 -7663701253196757745 6186919594556665391 -3949180708444757075 -906541736525164864 1145222059585816693 -5694780525921356753 -6416118184453982139 -1240723466460358343 -4987063972369326684 -1865991297969911880 -5072606873133083385 4743593246572236074 -6572867610188178040 -5039019332693993983 5295881919699410254 -8721997438799952762] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r448: replica (n3,s3):1 not lease holder; current lease is repl=(n2,s2):3 seq=7 start=1630680720.327323796,0 epo=1 pro=1630680720.327256939,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r448: replica (n3,s3):1 not lease holder; current lease is repl=(n2,s2):3 seq=7 start=1630680720.327323796,0 epo=1 pro=1630680720.327256939,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 44c95054cfe0108df872d2cda7952fc92259426b:

		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3405042-1630737365-89-n5cpu96:5 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4} returned
		  | stderr:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | 733820865 4046944035560929828 6070908395623657654 2942196773074556590 8773041021791496999 -5493691655833192555 5962480598186018088 1102668627022765224 4292575886556322331 377068516752503928 -3272524856573513081 3223092315987265925 1102668627022765224 6802633057167832302 8925806572076168886 3502653446166586085 -4034917596375771333 8396930428156549772 1620364530033403653 -2159962942925931739 1227531418922043807 6544273765834556728 8546687735915910491 -7209343669337677990 -630465615343209823 5089803903920779410 -4979155856875104342 -5632183279726515516 -2835301272358696229 -5787296814828132664 -8649381443682749401 2375599334895466311 -273830766452389596 -5498239006426617644 -7242230885120580315 -889002919094793525 3431085442219188202 5889335988326965496 -3420969055647742775 -963557543536388234 6642769424582831316 -1743817264288016029 7982584732665993784 -1009326739606503453 2066883589368775255 -7525823887027989948 -8497325453888236975 102327407294776684 7628499505306977905 5257282432382128057 7194227082981248111 -3954668184493435737 3279825110980478220] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r111: replica (n1,s1):3 not lease holder; current lease is repl=(n2,s2):2 seq=4 start=1630765375.452377755,0 epo=1 pro=1630765375.452325159,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r111: replica (n1,s1):3 not lease holder; current lease is repl=(n2,s2):2 seq=4 start=1630765375.452377755,0 epo=1 pro=1630765375.452325159,0 (SQLSTATE XXUUU)
		  | I210904 14:23:02.637708 2371 workload/pgx_helpers.go:61  [-] 7  pgx logger [error]: Exec logParams=map[args:[2561073701416371627 90 4394430385590768920 d1 2038786149488949527 4a 7444685391567129595 ba 5731955079637502568 d5 2987128428354613825 53 -6853836693695608360 10 -1755062787442207086 8f -7174753624286148914 b2 -3416362768297508362 e7 4885831952695097172 71 -7662193081186707917 c9 4232917050777471882 c2 6402228316180964499 da -6340660242868254583 9a -1471260237201352016 c6 -566488498119967953 c1 2381569473086943737 8a -3443352120482987687 1b -5082079551409648511 e0 467673683334578805 c6 8182266221166879202 f1 -2170282077710179897 e2 2219193035334649005 75 -877878578583281887 ef 5497888786842805695 cd 6111857553154093413 9c 6832231271761834081 0e -6273144411940669536 c7 6221935957347956465 4e 7838440216712476111 27 -2854059193622990065 b4 -745929879395053740 85 8792555902456196405 cc -441765554037163994 2f -7592572776128115763 08 9175574604365100051 c8 7166709946407824376 a3 -6597833247781245289 75 -1074705031797818057 5d -795925976360375207 7a -4833238135999364466 9a -2650526187185895540 7c -7211202963562638898 69 5715208387658537354 d4 7724273414095420673 25 -45005500217657165 dd 2806986467104644461 64 -8329214110808503016 04 3494341979509874429 a0 -7740402916908212866 40 -7883403306776991654 76 3221750883529035384 02 31496448019748172 2b 3120776656738584027 7f 2973204647534974588 1f 3087193455591742131 95 -4109659345614502861 a7 8246912348201669694 26 4853315007547851290 51 4508258004581146854 2d 4369420957432840006 1c -3689250316088258980 30 3752429975688659990 b7] err:timeout: context canceled sql:kv-2]
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ e369d86b3bac2b9f40d760c3e6e49f55b7e15abe:

		  | I210907 14:45:00.144369 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210907 14:45:08.235040 660 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210907 14:45:08.235732 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210907 14:45:08.326980 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 91.246993ms)
		  | I210907 14:45:08.607567 2110 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[-8343173776155126556 -914322950454871517 -685836765764345716 -7209343669337677990 2066883589368775255 3048245980912135216 -8319071614984614704 532981470037740769 5574232536423016768 -5090769401851757963 5313218355915302311 -6639483040728201979 -1009326739606503453 -5369387090647377696 2901961959851777777 3788960330020833741 -2018691486369830587 6211422866502778167 1177158258470849245 -5933828244096768983 -2275635875191238122 2283197493219629322 4127383475252594038 2966967207177351361 -4216484208988152898 7579154030635040905 7581892944513666311 -2490905978748825596 5451347624681530517 -8010102184843434886 -5363918375503941648 -8349707963029555597 -4407905743406431893 3752934507509607716 919082864081731136 -8056551791411552895 -5970401121588204437 2140940003149428359 2501556887624921142 -4236990650138812717 -5503333907021059451 6158057181624196422 4741184920310447324 -1623807323551530983 -4928511084171520986 -5365126547866111147 -5054786067318612342 -728903186720601223 -1789585662423979344 -2592815725797093114 3819459575849458158 -6912569259695542572 9202532695327585148 9112723449915371134 7126266934615051019 -8617536109860315868 -7095912189861582381 1749632190011704816 -2387497166026445323 2680711358950630419 2486984917766768371 -693198230315013616 2521535196169704509 -3160639275118474288] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r366: replica (n2,s2):2 not lease holder; current lease is repl=(n4,s4):3 seq=5 start=1631025902.759767859,0 epo=1 pro=1631025902.759711679,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r366: replica (n2,s2):2 not lease holder; current lease is repl=(n4,s4):3 seq=5 start=1631025902.759767859,0 epo=1 pro=1631025902.759711679,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 642e44afbe6098f25022618be76c6f7b6b97df45:

		  | I210908 11:55:15.909071 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
		  | I210908 11:55:23.816193 659 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
		  | I210908 11:55:23.816918 1 workload/cli/run.go:408  [-] 4  creating load generator...
		  | I210908 11:55:23.901769 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 84.850564ms)
		  | I210908 11:55:24.082719 2086 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[5576443193026923825 6299206555795082188 9157166580729214064 8646777393617720288 -5724883969894799675 -6897578880386467823 -6606375318306027013 -485870624885464806 7917266868945422353 5732879094913020916 -2190717052833672554 3011082493187145961 -3599523743411521019 -3219693044233996185 4298256374823820196 -6235906329266157420 -7681599993107674829 -6797145899829604906 1188031473152778240 5855548288912887416 2117934561670787669 -4742679533635683975 7128934072758675299 3023527810157456407 -6656763920844384398 -8729100882102463148 -6254855190543380724 -7112966269537054025 3476377908817320898 -2427793111266264477 6556128952054453056 726051506204224501 6575645057459341882 2442388724663567514 2781390435253487427 -8351262565230795732 -5749086669974027420 8249328495278473348 4798294684824285396 8378602000272484849 -4850957848834304902 7855571730243500859 -8057288572247355917 -344049463124163735 3726332353230092189 3053229696471589359 -2991827891992307820 807688452188690152 -4821760302956611142 -7517243810638472637 3921778827010504274 5593207286469826406 -4407905743406431893 -211894188292247813 -9086973824871479595 -1080491562993894710 -7380762123636204811 -6431087920246951008 6590873196518367421 -6815829807047622788 5460400427895041357 -2485856532151755454 -4150653389614699183 1764642074488156281] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r103: replica (n3,s3):1 not lease holder; current lease is repl=(n4,s4):2 seq=5 start=1631102117.432882067,0 epo=1 pro=1631102117.432805916,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
		  | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r103: replica (n3,s3):1 not lease holder; current lease is repl=(n4,s4):2 seq=5 start=1631102117.432882067,0 epo=1 pro=1631102117.432805916,0 (SQLSTATE XXUUU)
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 5. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@andreimatei
Copy link
Contributor

I think I see the problem. The way one of the DistSender replicas loop is written, it's possible that, as a result of a ctx cancellation, we return the error we got when trying out a previous replica instead of returning the ctx error. Which is not good. I'll improve the thing.

But, still, let's discuss about the downstream effects because I'm not sure that it's a good idea for people to rely on a particular contract from the DistSender about the error returned in case of ctx canceled.

My current guess is that because we're canceling the context of DistSender.Send once we no longer need the results of that scan, the context cancelation is surfaced back up as these different KV layer errors, but we swallow only the context cancellations at the parallel unordered synchronizer level. As a result, we incorrectly propagate the errors as query errors while we should have swallowed them.

I've read the code, but can you distill these lines more for me pls? Looks like the synchronizer is trying to figure out if it's draining and, if it is, whether it should swallow an error. First of all, it seems to me that I'd be better if the synchronizer looked at its own state to see if it's draining instead of inferring it from the errors. No?

Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for LeafTxnFinalStates being potentially missed because some stream coming into the synchronizer dropped but we're swallowing the respective error. So, how come swallowing context.Canceled is not dangerous? Can it not hide the fact that we're missing some metadata?

@yuzefovich
Copy link
Member

Looks like the synchronizer is trying to figure out if it's draining and, if it is, whether it should swallow an error.

Yes, that's essentially what we're trying to do.

First of all, it seems to me that I'd be better if the synchronizer looked at its own state to see if it's draining instead of inferring it from the errors. No?

Yeah, we could do that, but that wouldn't be sufficient for our purposes (note that we do assert that the synchronizer is in the draining state in crdb_test builds).

The goal of that code is as follows: we have encountered an error, and the question is - did this error occur precisely because we eagerly canceled the work since the synchronizer moved into the draining state? If that's the case, we want to swallow the error so that it doesn't propagate out.

So checking that the state is, in fact, draining is a good point (I'll make that change), but we also want to make sure not to swallow errors that occurred legitimately (i.e. not because of the eager cancellation triggered in synchronizer.DrainMeta). We need to somehow determine whether the error occurred because of us or because of other "natural" causes, and I don't know of any other way other than comparing the error against the context.Canceled error (as well as making sure that the synchronizer's main context hasn't been canceled).

Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for LeafTxnFinalStates being potentially missed because some stream coming into the synchronizer dropped but we're swallowing the respective error. So, how come swallowing context.Canceled is not dangerous? Can it not hide the fact that we're missing some metadata?

I believe we are covered on this front. We use this eager context cancellation only for fully local plans (i.e. there are no inboxes, so we won't break the gRPC streams by this eager context cancellation). Next, because we propagate the metadata differently in the vectorized engine (namely, all metadata is buffered and returned in DrainMeta calls) there is no metadata flowing during Operator.Next loop - which just broke by the cancellation - and we let each synchronizer's input proceed to fully draining its metadata sources. All LeafTxnFinalInfo metadata will be created in DrainMeta calls after each input has been interrupted.

Does this make sense?

I know that this error/string matching is not great, and the contract that all internal components cannot convert the context cancellation into any other error that doesn't contain "context canceled" when stringified is fragile, but it seems to work as is. Do you have any other suggestions about swallowing errors that occurred because of canceling the in-flight work? We will need to have something similar for the Streamer too I think.

@cockroach-teamcity

This comment has been minimized.

@andreimatei
Copy link
Contributor

Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for LeafTxnFinalStates being potentially missed because some stream coming into the synchronizer dropped but we're swallowing the respective error. So, how come swallowing context.Canceled is not dangerous? Can it not hide the fact that we're missing some metadata?

I believe we are covered on this front. We use this eager context cancellation only for fully local plans (i.e. there are no inboxes, so we won't break the gRPC streams by this eager context cancellation). Next, because we propagate the metadata differently in the vectorized engine (namely, all metadata is buffered and returned in DrainMeta calls) there is no metadata flowing during Operator.Next loop - which just broke by the cancellation - and we let each synchronizer's input proceed to fully draining its metadata sources. All LeafTxnFinalInfo metadata will be created in DrainMeta calls after each input has been interrupted.

Does this make sense?

It makes some sense. I was thinking that some distinction between "fully local plans" and the more general case is probably part of the answer.
Still:

but we also want to make sure not to swallow errors that occurred legitimately (i.e. not because of the eager cancellation triggered in synchronizer.DrainMeta). We need to somehow determine whether the error occurred because of us or because of other "natural" causes, and I don't know of any other way other than comparing the error against the context.Canceled error (as well as making sure that the synchronizer's main context hasn't been canceled).

I wonder why exactly this is - why do we need a distinction between errors caused by the cancellation and other errors? It seems to me that, as long as we can get all the metadata, any error received after the synchronizer started draining can be ignored. And if the everything is local, we can always get the metadata. Right?
What helps here is that DistSQL only deals with reads. For writes, things are more complicated. The client state needs to be kept in sync with the distributed transaction state, and an interrupted DistSender RPC can break that. In particular, if a request is writing intents, the client needs to know the highest timestamp that an intent was written at. That's why the TxnCoordSender distinguishes two categories of errors - errors for which this information is known, and errors for which it isn't. The former are "recoverable" - i.e. the txn can still be used afterwards (generally, to rollback to a savepoint). But, again, for reads I think we don't need to worry about that. All errors are recoverable.

This is all in theory, though. In practice, the TxnCoordSender always blocks further requests after most errors (including ctx cancelled):

// Most errors cause the transaction to not accept further requests (except a
// rollback), but some errors are safe to allow continuing (in particular
// ConditionFailedError). In particular, SQL can recover by rolling back to a
// savepoint.
if roachpb.ErrPriority(pErr.GoError()) != roachpb.ErrorScoreUnambiguousError {
tc.mu.txnState = txnError
tc.mu.storedErr = roachpb.NewError(&roachpb.TxnAlreadyEncounteredErrorError{
PrevError: pErr.String(),
})
}

So, I guess my question is - how does your swallowing of ctx cancelled errors in the synchronizer even work given that the TxnCoordSender remembers the error state? Do we perhaps only do it when using a LeafTxn?

@yuzefovich
Copy link
Member

I wonder why exactly this is - why do we need a distinction between errors caused by the cancellation and other errors? It seems to me that, as long as we can get all the metadata, any error received after the synchronizer started draining can be ignored.

I was also thinking about this when I originally added this code, and I came up with a following scenario: imagine we have TABLE t (c INT) with rows 0 and 1, then imagine we're performing SELECT 1 / c FROM t LIMIT 1 and that we have the parallel unordered sync with two concurrent TableReaders. In case 1 is read first, the limit is satisfied, so the sync transitions into draining, and the question is: do we want to swallow the "division by zero" error that might occur if we read row 0 and perform the division? My thinking was that it is "safer" to not swallow the error and propagate it back to the client. I do, however, see your point that if the synchronizer is draining, then any work can be abandoned because the caller of the synchronizer is only interested in the metadata from now on.

Another scenario is when we have a "race" between the input goroutine doing work and the synchronizer transitioning into the draining state. Imagine that the input goroutine has performed some work and encountered an error that should be returned back to the client, but before colexecerror.InternalError is executed, the goroutine is put to sleep. Now, the main synchronizer goroutine wakes up and transitions into the draining state, and the question is again the same - do we want to swallow that error or not.

And if the everything is local, we can always get the metadata. Right?

Yep, that's right.

Now that we're discussing this, your suggestion of swallowing all errors once the synchronizer transitioned into draining and performed the context cancellation of the inputs seems like it might be actually a safer choice - this will allow us to not require any invariants from the internal components about how they handle the context cancellation errors. I guess the scenarios I mentioned above could go either way anyway.

So, I guess my question is - how does your swallowing of ctx cancelled errors in the synchronizer even work given that the TxnCoordSender remembers the error state? Do we perhaps only do it when using a LeafTxn?

We are using this functionality of eager cancellation only in very limited set of circumstances - the query doesn't have any mutations, is executed locally, and a few other requirements, so we aren't concerned about writes. And we do use LeafTxn if we plan multiple parallel processors for a single stage of the physical plan.

@andreimatei
Copy link
Contributor

In my opinion, for both scenarios you describe, swallowing errors is fine. When an error races with a drain, I think either outcome is fine. In the SELECT 1 / c FROM t LIMIT 1 case, receiving or not receiving the error is non-deterministic because the ordering or the results is non-deterministic. If you want determinism, you gotta specify an ordering. In the 2nd example, infrastructure errors are nondetereministic because they just are nondeterministic by nature. The cancellation race perhaps makes them slightly more non-deterministic, but they were non-deterministic to begin with.

We are using this functionality of eager cancellation only in very limited set of circumstances - the query doesn't have any mutations, is executed locally, and a few other requirements, so we aren't concerned about writes. And we do use LeafTxn if we plan multiple parallel processors for a single stage of the physical plan.

So we only use a synchronizer is we have parallelism, and if we have parallelism, we're using a LeafTxn for all parallel branches? We don't use the root txn on any of the branches? Cause if we were, and it was the request on the root that was getting canceled, you wouldn't be able to swallow that error (unfortunately). Figuring out the root txn behavior after an error is a can of worms that we keep passing down the road; the current story is not very good.

Now that we're discussing this, your suggestion of swallowing all errors once the synchronizer transitioned into draining and performed the context cancellation of the inputs seems like it might be actually a safer choice - this will allow us to not require any invariants from the internal components about how they handle the context cancellation errors. I guess the scenarios I mentioned above could go either way anyway.

Yeah; if the root is not in play here, then this is what I'd do.
I'm looking into improving things in the TxnCoordSender because it's a reasonable request to have the ctx cancelation be on the returned error chain as much as possible, but I don't think I'll be able to make any guarantees about it. There's at least two problems:

  1. If the ctx cancelation interrupts the evaluation of a request, then that request will return some error (which hopefully Is() a ctx cancelled, but who knows). I'm not going to override that error.
  2. If the DistSender has gotten an ambiguous error in the past (i.e. it might have committed the transaction), and the ctx cancelation comes while the DistSender is trying to resolve the ambiguity of that error, I want to return the ambiguous error, not override it. There's a related PR roachpb: improve AmbiguousResultError #65714 that flirts with making AmbiguousResultError implement causer so that we can put the ctx cancellation as a cause, but that still doesn't cover everything.
    This 2nd case probably doesn't affect you cause you're not committing anything, but still all this says that I wouldn't rely on the DistSender identifying the ctx cancellation always.

@yuzefovich
Copy link
Member

So we only use a synchronizer is we have parallelism, and if we have parallelism, we're using a LeafTxn for all parallel branches? We don't use the root txn on any of the branches?

Yes, we will use only the LeafTxn for the whole flow if the flow has any concurrency (which we can determine during the physical planning):

if localState.HasConcurrency {
var err error
leafTxn, err = makeLeaf(req)
if err != nil {
return nil, nil, nil, err
}
}

Thanks for the context, it makes sense. I think #69961 without any other changes is sufficient to fix this issue.

@craig craig bot closed this as completed in 955f398 Sep 9, 2021
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 13, 2021
This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled). It's
also hard to test.
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches cockroachdb#69419

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 24, 2021
This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled). It's
also hard to test.
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches cockroachdb#69419

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 23, 2022
This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled). It's
also hard to test.
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches cockroachdb#69419

Release note: None
craig bot pushed a commit that referenced this issue Jan 28, 2022
70159: kvclient: switch error in ctx cancel edge case r=andreimatei a=andreimatei

This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled).
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches #69419

Release note: None

75632: execstats: fix flakey TestTraceAnalyzer r=yuzefovich a=adityamaru

Fixes: #75546

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 29, 2022
This patch changes the error returned by the DistSender on a cancelled
ctx. Depending on the exactly who detects the ctx as cancelled, there
are a number of different possibilities - too many to enumerate here.
Generally, the client is supposed to get a context.Canceled error,
wrapped in different layers. The code path that this patch changes is
about the cancellation being detected by sendPartialBatch() without it
having been previously detected by sendToReplicas(). This path is
unusual (sendToReplicas() generally detects the ctx cancelled). It's
also hard to test.
Depending on the exact cancellation timing, it's possible though for
sendPartialBatch() to detect it instead. In this case, this patch makes
it so that, if sendToReplicas() returned a sendError (indicating that
a suitable replica could not be reached and that the higher layer is
expected to continue trying other replicas), the error returned to the
client is a cancellation error instead of the sendError.

Touches cockroachdb#69419

Release note: None
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants