roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] #84635

cockroach-teamcity · 2022-07-19T05:29:28Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ 7b257ecbb2c0bd9842e33a816b0907ad64a89787:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 780331013390860290 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 780331013390860290 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #84148 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]
roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!

Jira issue: CRDB-17771}

The text was updated successfully, but these errors were encountered:

msbutler · 2022-07-19T21:43:02Z

The restore job failed on node 2. I see the following error in the jobs table. This smells like KV, but i will look a bit more at the logs in the morning.

importing 41 ranges: splitting key /Table/106/1/3983196: change replicas of r86 failed: received invalid ChangeReplicasTrigger LEAVE_JOINT: after=[(n1,s1):1 (n4,s4):2 (n2,s2):3LEARNER (n3,s3):4] next=5 to remove self (leaseholder)

In node 2's logs, I see:

I220719 05:28:55.014336 3478 kv/kvserver/pkg/kv/kvserver/replica_raft.go:335 ⋮ [n2,s2,r86/3:‹/Table/106/1/{3983196-4272459}›] 267  proposing LEAVE_JOINT: after=[(n1,s1):1 (n4,s4):2 (n2,s2):3LEARNER (n3,s3):4] next=5
406 E220719 05:28:55.014441 3478 kv/kvserver/pkg/kv/kvserver/replica_raft.go:396 ⋮ [n2,s2,r86/3:‹/Table/106/1/{3983196-4272459}›] 268  received invalid ChangeReplicasTrigger LEAVE_JOINT: after=[(n1,s1):1 (n4,s4):2 (n2,s2):3LEARNER (n3,s3):4] next=5 to remove self (leaseholder); lhRemovalAllowed: true; proposed descr    iptor: r86:‹/Table/106/1/{3983196-4272459}› [(n1,s1):1, (n4,s4):2, (n2,s2):3LEARNER, (n3,s3):4, next=5, gen=18, sticky=1658212127.482398094,1]: replica cannot hold lease

I'm removing the release blocker because this error seems to be a retryable error:
https://github.com/msbutler/cockroach/blob/3c09757d884e77e8b8dfc2a6c587d3caf70d3ae5/pkg/kv/test_utils.go#L51

msbutler · 2022-07-20T15:55:31Z

Bulk should certainly have retried the RESTORE due to this retryable error, which is easily checked via IsRetriableReplicationChangeError. We could just add to the white list in IsPermanentBulkJobError; however, it would be much more performant if this error (and other temporal kv errors) lead to retries on the request level.

I wonder if it's worth wrapping every bulk kv request with a function that handles these temperamental kv request errors. Will add a discussion item for next week.

Informs cockroachdb#84635,cockroachdb#84162 Release note: none

…shot Informs cockroachdb#84635 Informs cockroachdb#84162 This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None

cockroach-teamcity · 2022-08-03T16:43:04Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ da1c029daa85869e16b19a73c228546c98830379:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 784710145872134146 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 784710145872134146 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #84148 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]
roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-08-18T05:27:39Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ 760a8253ae6478d69da0330133e3efec8e950e4e:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 788824123504820226 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 788824123504820226 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #84148 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]
roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-08-25T05:28:32Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ f81f08fc08acd9cd1e0017890d82606741a744f5:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 790805938909347842 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 790805938909347842 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #84148 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]
roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-08-26T05:26:39Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ fa6b3adbc5f1e3b85e99a88ca71da11213c8b25a:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 791088832542539778 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 791088832542539778 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #84148 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]
roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

_{This test on roachdash | Improve this report!}

…shot Informs cockroachdb#84635 Informs cockroachdb#84162 Fixes cockroachdb#85449. Fixes cockroachdb#83174. This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None Release justification: Low risk.

85405: kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot r=shralex a=nvanbenschoten Informs #84635. Informs #84162. Fixes #85449. Fixes #83174. This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None Release justification: Low risk, resolves flaky test. 87137: storage: default to TableFormatPebblev1 in backups r=itsbilal,dt a=jbowens If the v22.2 upgrade has not yet been finalized, so we're not permitted to use the new TableFormatPebblev2 sstable format, default to TableFormatPebblev1 which is the format used by v22.1 internally. This change is intended to allow us to remove code for understanding the old RocksDB table format version sooner (eg, v23.1). Release justification: low-risk updates to existing functionality Release note: None 87152: sql: encode either 0 or 1 spans in scan gists r=mgartner a=mgartner #### dev: add rewritable paths for pkg/sql/opt/exec/explain tests This commit adds fixtures in `pkg/sql/opt/testutils/opttester/testfixtures` as rewritable paths for tests in `pkg/sql/opt/exec/explain`. This prevents `dev test pkg/sql/opt/exec/explain` from erring when the `--rewrite` flag is used. Release justification: This is a test-only change. Release note: None #### sql: encode either 0 or 1 spans in scan gists In plan gists, we no longer encode the exact number of spans for scans so that two queries with the same plan but a different number of spans have the same gist. In addition, plan gists are now decoded with the `OnlyShape` flag which prints any non-zero number of spans as "1+ spans" and removes attributes like "missing stats" from scans. Fixes #87138 Release justification: This is a minor change that makes plan gist instrumentation more scalable. Release note (bug fix): The Explain Tab inside the Statement Details page now groups plans that have the same shape but a different number of spans in corresponding scans. 87154: roachtest: stop cockroach gracefully when upgrading nodes r=yuzefovich a=yuzefovich This commit makes it so that we stop cockroach nodes gracefully when upgrading them. Previous abrupt behavior of stopping the nodes during the upgrade could lead to test flakes because the nodes were not being properly drained. Here is one scenario for how one of the flakes (`pq: version mismatch in flow request: 65; this node accepts 69 through 69`, which means that a gateway running an older version asks another node running a newer version to do DistSQL computation, but the versions are not DistSQL compatible): - we are in a state when node 1 is running a newer version when node 2 is running an older version. Importantly, node 1 was upgraded "abruptly" meaning that it wasn't properly drained; in particular, it didn't send DistSQL draining notification through gossip. - newer node has already been started but its DistSQL server hasn't been started yet (although it already can accept incoming RPCs - see comments on `distsql.ServerImpl.Start` for more details). This means that newer node has **not** sent through gossip an update about its DistSQL version. - node 2 acts as the gateway for a query that reads some data that node 1 is the leaseholder for. During the physical planning, older node 2 checks whether newer node 1 is "healthy and compatible", and node 1 is deemed both healthy (because it can accept incoming RPCs) and is compatible (because node 2 hasn't received updated DistSQL version of node 1 since it hasn't been sent yet). As a result, node 2 plans a read on node 1. - when node 1 receives that request, it errors out with "version mismatch" error. This whole problem is solved if we stop nodes gracefully when upgrading them. In particular, this will mean that node 1 would first dissipate its draining notification across the cluster, so during the physical planning it will only be considered IFF node 1 has already communicated its updated DistSQL version, and then it would be deemed DistSQL-incompatible. I verified that this scenario is possible (with manual adjustments of the version upgrade test and cockroach binary to insert a delay) and that it's fixed by this commit. I believe it is likely that other flake types have the same root cause, but I haven't verified it. Fixes: #87104. Release justification: test-only change. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

cockroach-teamcity · 2022-09-09T05:31:02Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ 7c8d6e5034b135d475440c8e93385b229dea512f:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 795053346385100802 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 795053346385100802 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-09-14T05:27:54Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ 052a73d5942c460322afc299aa21ca3d772bf96f:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 796468277791686658 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 796468277791686658 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-09-27T05:29:33Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-22.1 @ 9eb4da2a351b2fe4706be9aa4af27ee2e60b405e:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:308,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 800149050038747138 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:308
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 800149050038747138 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: restore/nodeShutdown/worker failed #88667 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-disaster-recovery T-kv blocks-22.2.0-beta.2 branch-master release-blocker]
roachtest: restore/nodeShutdown/worker failed #88469 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-disaster-recovery T-kv blocks-22.2.0-beta.2 branch-release-22.2 release-blocker]
roachtest: restore/nodeShutdown/worker failed #80821 roachtest: restore/nodeShutdown/worker failed [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.1]

_{This test on roachdash | Improve this report!}

msbutler · 2022-09-27T12:57:55Z

@dt looks like the job was in an unexpected failed state again. Unassigning myself and assigning you as L2 for further investigation.

stevendanna · 2022-10-03T11:58:22Z

Here, the job failed because of:

I220927 05:29:13.587686 2613 jobs/registry.go:1202 â‹® [n2] 324  RESTORE job 800149050038747138: stepping through state reverting with error: importing 41 ranges: splitting key â€¹/Table/106/1/4272459â€º: change replicas of r70 failed: (n1,s1):3VOTER_DEMOTING_LEARNER received invalid ChangeReplicasTrigger LEAVE_JOINT: after=[(n4,s4):1 (n2,s2):2 (n1,s1):3LEARNER (n3,s3):4] next=5 to remove self (leaseholder); lhRemovalAllowed: true; current desc: r70:â€¹/Table/106/1/4{272459-436510}â€º [(n4,s4):1, (n2,s2):2, (n1,s1):3VOTER_DEMOTING_LEARNER, (n3,s3):4VOTER_INCOMING, next=5, gen=25, sticky=1664260112.749697632,0]; proposed desc: r70:â€¹/Table/106/1/4{272459-436510}â€º [(n4,s4):1, (n2,s2):2, (n1,s1):3LEARNER, (n3,s3):4, next=5, gen=26, sticky=1664260112.749697632,0]: replica cannot hold lease

nvanbenschoten · 2022-10-10T05:43:53Z

Fixed by #89611.

cockroach-teamcity added this to the 22.1 milestone Jul 19, 2022

blathers-crl bot added the T-disaster-recovery label Jul 19, 2022

exalate-issue-sync bot assigned msbutler Jul 19, 2022

msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 19, 2022

msbutler added a commit to msbutler/cockroach that referenced this issue Jul 29, 2022

backupccl: retry restore's AdminSplit on ephemeral errors

6b6cbcc

Informs cockroachdb#84635,cockroachdb#84162 Release note: none

msbutler mentioned this issue Jul 29, 2022

backupccl: retry restore's AdminSplit on ephemeral errors #85333

Closed

cockroach-teamcity mentioned this issue Jul 30, 2022

roachtest: restore/nodeShutdown/worker failed #84148

Closed

nvanbenschoten mentioned this issue Aug 1, 2022

kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot #85405

Merged

livlobo added the stability-period Intended to be worked on during a stability period. Use with the Milestone field to specify version. label Aug 4, 2022

msbutler changed the title ~~roachtest: restore/nodeShutdown/worker failed~~ roachtest: restore/nodeShutdown/worker failed [from transient admin split error] Sep 9, 2022

msbutler changed the title ~~roachtest: restore/nodeShutdown/worker failed [from transient admin split error]~~ roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] Sep 9, 2022

cockroach-teamcity mentioned this issue Sep 22, 2022

roachtest: restore/nodeShutdown/worker failed #88469

Closed

cockroach-teamcity mentioned this issue Sep 24, 2022

roachtest: restore/nodeShutdown/worker failed #88667

Closed

msbutler removed their assignment Sep 27, 2022

msbutler assigned dt Sep 27, 2022

msbutler changed the title ~~roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error]~~ roachtest: restore/nodeShutdown/worker failed [unexpected job failure] Sep 27, 2022

msbutler added the X-noreuse Prevent automatic commenting from CI test failures label Oct 3, 2022

exalate-issue-sync bot assigned msbutler and unassigned dt Oct 6, 2022

exalate-issue-sync bot changed the title ~~roachtest: restore/nodeShutdown/worker failed [unexpected job failure]~~ roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] Oct 6, 2022

exalate-issue-sync bot removed the X-noreuse Prevent automatic commenting from CI test failures label Oct 6, 2022

nvanbenschoten closed this as completed Oct 10, 2022

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] #84635

roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] #84635

cockroach-teamcity commented Jul 19, 2022 •

edited by cockroach-jira-scripts

Loading

msbutler commented Jul 19, 2022 •

edited

Loading

msbutler commented Jul 20, 2022

cockroach-teamcity commented Aug 3, 2022

cockroach-teamcity commented Aug 18, 2022

cockroach-teamcity commented Aug 25, 2022

cockroach-teamcity commented Aug 26, 2022

cockroach-teamcity commented Sep 9, 2022

cockroach-teamcity commented Sep 14, 2022

cockroach-teamcity commented Sep 27, 2022

msbutler commented Sep 27, 2022 •

edited

Loading

stevendanna commented Oct 3, 2022

nvanbenschoten commented Oct 10, 2022

roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] #84635

roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] #84635

Comments

cockroach-teamcity commented Jul 19, 2022 • edited by cockroach-jira-scripts Loading

msbutler commented Jul 19, 2022 • edited Loading

msbutler commented Jul 20, 2022

cockroach-teamcity commented Aug 3, 2022

cockroach-teamcity commented Aug 18, 2022

cockroach-teamcity commented Aug 25, 2022

cockroach-teamcity commented Aug 26, 2022

cockroach-teamcity commented Sep 9, 2022

cockroach-teamcity commented Sep 14, 2022

cockroach-teamcity commented Sep 27, 2022

msbutler commented Sep 27, 2022 • edited Loading

stevendanna commented Oct 3, 2022

nvanbenschoten commented Oct 10, 2022

cockroach-teamcity commented Jul 19, 2022 •

edited by cockroach-jira-scripts

Loading

msbutler commented Jul 19, 2022 •

edited

Loading

msbutler commented Sep 27, 2022 •

edited

Loading