Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: change-replicas/mixed-version failed #98429

Closed
cockroach-teamcity opened this issue Mar 11, 2023 · 1 comment
Closed

roachtest: change-replicas/mixed-version failed #98429

cockroach-teamcity opened this issue Mar 11, 2023 · 1 comment
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 11, 2023

roachtest.change-replicas/mixed-version failed with artifacts on master @ 5b2a5670cbbe895d76602c230390816e783e0caa:

test artifacts and logs in: /artifacts/change-replicas/mixed-version/run_1
(mixed_version_change_replicas.go:163).1: failed to move 1 replicas from n1 to n4 using gateway n3
(assertions.go:264).Fail: 
	Error Trace:	/go/src/github.com/cockroachdb/cockroach/mixed_version_change_replicas.go:102
	            				/go/src/github.com/cockroachdb/cockroach/panic.go:890
	            				/go/src/github.com/cockroachdb/cockroach/test_impl.go:298
	            				/go/src/github.com/cockroachdb/cockroach/mixed_version_change_replicas.go:163
	            				/go/src/github.com/cockroachdb/cockroach/versionupgrade.go:169
	            				/go/src/github.com/cockroachdb/cockroach/mixed_version_change_replicas.go:326
	            				/go/src/github.com/cockroachdb/cockroach/test_runner.go:975
	            				/go/src/github.com/cockroachdb/cockroach/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	context canceled
	Test:       	change-replicas/mixed-version
(require.go:1264).NoError: FailNow called

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-25261

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Mar 11, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 11, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 13, 2023

We weren't able to move a r445 replica from n1 to n4 because n4 already had a learner:

06:34:50 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:34:51 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:34:51 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:34:51 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:34:51 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:34:51 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:34:51 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:34:52 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:34:52 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:34:54 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:34:54 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:34:57 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:34:57 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:35:02 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:35:02 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:35:07 mixed_version_change_replicas.go:133: 1 ranges failed, retrying
06:35:07 mixed_version_change_replicas.go:135: moving replicas from n1 to n4 via gateway n3 using ALTER TABLE RELOCATE
06:35:07 mixed_version_change_replicas.go:160: failed to move r445 from n1 to n4 via n3: trying to add({ChangeType:ADD_VOTER Target:n4,s4}) to a store that already has a LEARNER

At 06:34:50 the n4 replica was in fact a LEARNER, but an outgoing one as it had been transferring its replica to n3.

I230311 06:34:50.711229 6741 13@kv/kvserver/replica_command.go:2324 ⋮ [T1,n4,s4,r445/18:‹/Table/170/1/7{8-9}›,*kvpb.AdminChangeReplicasRequest] 365  change replicas (add [] remove []): existing descriptor r445:‹/Table/170/1/7{8-9}› [(n1,s1):20, (n4,s4):18VOTER_DEMOTING_LEARNER, (n2,s2):21, (n3,s3):22VOTER_INCOMING, next=23, gen=182, sticky=9223372036.854775807,2147483647]
I230311 06:34:50.714806 6741 13@kv/kvserver/replica_command.go:2324 ⋮ [T1,n4,s4,r445/18:‹/Table/170/1/7{8-9}›,*kvpb.AdminChangeReplicasRequest] 366  change replicas (add [] remove []): existing descriptor r445:‹/Table/170/1/7{8-9}› [(n1,s1):20, (n4,s4):18VOTER_DEMOTING_LEARNER, (n2,s2):21, (n3,s3):22VOTER_INCOMING, next=23, gen=182, sticky=9223372036.854775807,2147483647]
I230311 06:34:50.718173 17253 13@kv/kvserver/replica_raft.go:370 ⋮ [T1,n3,s3,r445/22:‹/Table/170/1/7{8-9}›] 328  proposing LEAVE_JOINT: after=[(n1,s1):20 (n4,s4):18LEARNER (n2,s2):21 (n3,s3):22] next=23
I230311 06:34:50.718173 17253 13@kv/kvserver/replica_raft.go:370 ⋮ [T1,n3,s3,r445/22:‹/Table/170/1/7{8-9}›] 328  proposing LEAVE_JOINT: after=[(n1,s1):20 (n4,s4):18LEARNER (n2,s2):21 (n3,s3):22] next=23

However, n4 had failed to remove itself from the range because it was the leaseholder:

E230311 06:34:13.445036 2407 kv/kvserver/replica_raft.go:429 ⋮ [T1,n4,s4,r445/18:‹/Table/170/1/7{8-9}›] 80  (n4,s4):18VOTER_DEMOTING_LEARNER received invalid ChangeReplicasTrigger LEAVE_JOINT: after=[(n1,s1):20 (n4,s4):18LEARNER (n2,s2):21 (n3,s3):22] next=23 to remove self (leaseholder); lhRemovalAllowed: true; current desc: r445:‹/Table/170/1/7{8-9}› [(n1,s1):20, (n4,s4):18VOTER_DEMOTING_LEARNER, (n2,s2):21, (n3,s3):22VOTER_INCOMING, next=23, gen=182, sticky=9223372036.854775807,2147483647]; proposed desc: r445:‹/Table/170/1/7{8-9}› [(n1,s1):20, (n4,s4):18LEARNER, (n2,s2):21, (n3,s3):22, next=23, gen=183, sticky=9223372036.854775807,2147483647]: replica cannot hold lease
E230311 06:34:13.459599 2413 kv/kvserver/replica_raft.go:429 ⋮ [T1,n4,s4,r445/18:‹/Table/170/1/7{8-9}›] 81  (n4,s4):18VOTER_DEMOTING_LEARNER received invalid ChangeReplicasTrigger LEAVE_JOINT: after=[(n1,s1):20 (n4,s4):18LEARNER (n2,s2):21 (n3,s3):22] next=23 to remove self (leaseholder); lhRemovalAllowed: true; current desc: r445:‹/Table/170/1/7{8-9}› [(n1,s1):20, (n4,s4):18VOTER_DEMOTING_LEARNER, (n2,s2):21, (n3,s3):22VOTER_INCOMING, next=23, gen=182, sticky=9223372036.854775807,2147483647]; proposed desc: r445:‹/Table/170/1/7{8-9}› [(n1,s1):20, (n4,s4):18LEARNER, (n2,s2):21, (n3,s3):22, next=23, gen=183, sticky=9223372036.854775807,2147483647]: replica cannot hold lease

While relocating replicas, we do disable the replicate queue, which I guess prevented the configuration change from completing. I'll submit a PR to speed up replicate queue processing and resume processing between retries, which hopefully makes this less flaky.

@shralex You were involved in some related work with leases and configuration changes, can you check if there's anything we need to examine further here? Is the failed configuration change benign?

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 13, 2023
@erikgrinaker erikgrinaker self-assigned this Mar 13, 2023
craig bot pushed a commit that referenced this issue Mar 13, 2023
98491: roachtest: maybe deflake `change-replicas/mixed-version` r=erikgrinaker a=erikgrinaker

Touches #98429.

Epic: none
Release note: None

98504: upgrade: Remove two to-be-deleted-V22_2 cluster versions r=Xiang-Gu a=Xiang-Gu

This commit removed the following two cluster versions and its associated upgrade logic and tests:
 - V22_2UpgradeSequenceToBeReferencedByID
 - V22_2UpdateInvalidColumnIDsInSequenceBackReferences

Informs: #96763, #96751
Release Note: None

98508: kv: deflake TestAbortCountConflictingWrites r=nvanbenschoten a=nvanbenschoten

Fixes #96839.

The test was made flaky by 5129578. See the comment in #96839 (comment) for an explanation.

This commit resolves that flakiness.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

2 participants