Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/during/kv failed #85449

Closed
cockroach-teamcity opened this issue Aug 2, 2022 · 3 comments · Fixed by #85405
Closed

roachtest: schemachange/during/kv failed #85449

cockroach-teamcity opened this issue Aug 2, 2022 · 3 comments · Fixed by #85405
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 2, 2022

roachtest.schemachange/during/kv failed with artifacts on master @ 748b89a3fbef294f4b0f930c9dbdf88294b3deeb:

test artifacts and logs in: /artifacts/schemachange/during/kv/run_1
	schemachange.go:49,monitor.go:105,errgroup.go:74: pq: importing 688 ranges: splitting key /Table/111/1/31834467: change replicas of r105 failed: descriptor changed: [expected] r105:/Table/111/1/{26192423-47979489} [(n2,s2):1VOTER_DEMOTING_LEARNER, (n5,s5):2, (n4,s4):3, (n3,s3):4VOTER_INCOMING, next=5, gen=18, sticky=1659433859.194661639,0] != [actual] r105:/Table/111/1/2{6192423-7379106} [(n3,s3):4VOTER_DEMOTING_LEARNER, (n5,s5):2, (n4,s4):3, (n1,s1):5VOTER_INCOMING, next=6, gen=23, sticky=1659433859.194661639,0]

	monitor.go:127,schemachange.go:53,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerSchemaChangeDuringKV.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/schemachange.go:53
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6222
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:233
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

Jira issue: CRDB-18251

Epic CRDB-19172

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 2, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Aug 2, 2022
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Aug 2, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 2, 2022
@ajwerner
Copy link
Contributor

ajwerner commented Aug 2, 2022

It feels like adding robustness to transient failures to split inside of SplitAndScatter falls on @cockroachdb/kv-distribution or some KV team.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 29, 2022

This is surprising because we do have a retry loop in executeAdminCommandWithDescriptor that shouldn't allow such an error to escape.

I think the key is that this is coming from "change replicas", so it must be returned by maybeLeaveAtomicChangeReplicas before running the split. We should retry on those errors as well.

@nvanbenschoten
Copy link
Member

I think I understand why this is failing to be retried. executeAdminCommandWithDescriptor should be retrying on a ConditionFailedError:

if !errors.HasType(lastErr, (*roachpb.ConditionFailedError)(nil)) &&

To do that, it uses errors.HasType.

However, when we construct this error, we use errors.WithSecondaryError, which deliberately shifts the ConditionFailedError to a "secondary" error position.

if ok, actualDesc := maybeDescriptorChangedError(referenceDesc, err); ok {
// We do not include the original error as cause in this case -
// the caller should not observe the cause. We still include it
// as "secondary payload", in case the error object makes it way
// to logs or telemetry during a crash.
err = errors.WithSecondaryError(newDescChangedError(referenceDesc, actualDesc), err)

This is not checked by errors.HasType.

I don't yet understand the comment here, but this would be fixed by #85405.

@nvanbenschoten nvanbenschoten self-assigned this Aug 30, 2022
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv KV Team T-sql-schema-deprecated Use T-sql-foundations instead labels Aug 30, 2022
craig bot pushed a commit that referenced this issue Aug 31, 2022
85405: kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot r=shralex a=nvanbenschoten

Informs #84635.
Informs #84162.
Fixes #85449.
Fixes #83174.

This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
status to be a form of a retriable replication change error. It then hooks
`Replica.executeAdminCommandWithDescriptor` up to consult this status in its
retry loop.

This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4. can’t transfer lease because doing so is deemed to be potentially unsafe

Release note: None

Release justification: Low risk, resolves flaky test.

87137: storage: default to TableFormatPebblev1 in backups r=itsbilal,dt a=jbowens

If the v22.2 upgrade has not yet been finalized, so we're not permitted
to use the new TableFormatPebblev2 sstable format, default to
TableFormatPebblev1 which is the format used by v22.1 internally.

This change is intended to allow us to remove code for understanding the
old RocksDB table format version sooner (eg, v23.1).

Release justification: low-risk updates to existing functionality
Release note: None

87152: sql: encode either 0 or 1 spans in scan gists r=mgartner a=mgartner

#### dev: add rewritable paths for pkg/sql/opt/exec/explain tests

This commit adds fixtures in
`pkg/sql/opt/testutils/opttester/testfixtures` as rewritable paths for
tests in `pkg/sql/opt/exec/explain`. This prevents
`dev test pkg/sql/opt/exec/explain` from erring when the `--rewrite`
flag is used.

Release justification: This is a test-only change.

Release note: None

#### sql: encode either 0 or 1 spans in scan gists

In plan gists, we no longer encode the exact number of spans for scans
so that two queries with the same plan but a different number of spans
have the same gist.

In addition, plan gists are now decoded with the `OnlyShape` flag which
prints any non-zero number of spans as "1+ spans" and removes attributes
like "missing stats" from scans.

Fixes #87138

Release justification: This is a minor change that makes plan gist
instrumentation more scalable.

Release note (bug fix): The Explain Tab inside the Statement Details
page now groups plans that have the same shape but a different number of
spans in corresponding scans.


87154: roachtest: stop cockroach gracefully when upgrading nodes r=yuzefovich a=yuzefovich

This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.

Here is one scenario for how one of the flakes (`pq: version mismatch in
flow request: 65; this node accepts 69 through 69`, which means that
a gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):
- we are in a state when node 1 is running a newer version when node
2 is running an older version. Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.
- newer node has already been started but its DistSQL server hasn't been
started yet (although it already can accept incoming RPCs - see comments
on `distsql.ServerImpl.Start` for more details). This means that newer
node has **not** sent through gossip an update about its DistSQL version.
- node 2 acts as the gateway for a query that reads some data that node
1 is the leaseholder for. During the physical planning, older node
2 checks whether newer node 1 is "healthy and compatible", and node 1 is
deemed both healthy (because it can accept incoming RPCs) and is
compatible (because node 2 hasn't received updated DistSQL version of
node 1 since it hasn't been sent yet). As a result, node 2 plans a read
on node 1.
- when node 1 receives that request, it errors out with "version
mismatch" error.

This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.

I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.

Fixes: #87104.

Release justification: test-only change.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in a420d39 Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants