Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicate/wide failed #50729

Closed
cockroach-teamcity opened this issue Jun 27, 2020 · 14 comments · Fixed by #56735
Closed

roachtest: replicate/wide failed #50729

cockroach-teamcity opened this issue Jun 27, 2020 · 14 comments · Fixed by #56735
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).replicate/wide failed on master@8f768ad14cfb3f514db6d40465b2dd60ee1f2890:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:804: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 27, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jun 27, 2020
@darinpp
Copy link
Contributor

darinpp commented Jul 7, 2020

This timed out in the last step - decrease the replication factor to 5 and verify the replicas per range again falls (128e2dc)

@irfansharif
Copy link
Contributor

Darin, mind if I assign it to you? Very briefly glancing at the logs, and the area of the codebase I would suspect responsible, it looks to be the allocator.

E200627 11:26:20.984817 161942 kv/kvserver/queue.go:1087  [n1,replicate,s1,r4/1:/System{/tsd-tse}] could not select an appropriate replica to be removed
E200627 11:26:20.989962 161952 kv/kvserver/queue.go:1087  [n1,replicate,s1,r8/14:/Table/1{2-3}] could not select an appropriate replica to be removed
E200627 11:26:21.010495 161989 kv/kvserver/queue.go:1087  [n1,replicate,s1,r7/11:/Table/1{1-2}] could not select an appropriate replica to be removed
E200627 11:26:21.015722 161999 kv/kvserver/queue.go:1087  [n1,replicate,s1,r31/12:/Table/3{5-6}] could not select an appropriate replica to be removed
E200627 11:26:21.020770 162009 kv/kvserver/queue.go:1087  [n1,replicate,s1,r21/6:/Table/2{5-6}] could not select an appropriate replica to be removed

Also consider #50865, which is a bug around the split queue being wedged for some reason. I'm not sure that it has any bearing here, but just to keep on your radar.

@irfansharif irfansharif assigned darinpp and unassigned andreimatei Jul 13, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@a16eb55ed96239dcd288aa1c2f80f306559f0f0b:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@3edbe4aeb3c7300e6690cb2222a8d5c01e920bf4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@ebd5c732f83009acc9c6f5859ca95e74e5453a1c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@bbbedabbf6ea0b1ff6fc799a0c04a75295a9f4c2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@69ffd78d5bbab0d8f77cf1f2254e8a5fcbdf902f:

		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 3 safe details enclosed
		  | Wraps: (3) 8: dead
		  | Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errors.errorString
		Wraps: (5) secondary error attachment
		  | 1: dead
		  | (1) attached stack trace
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 3 safe details enclosed
		  | Wraps: (3) 1: dead
		  | Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errors.errorString
		Wraps: (6) attached stack trace
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (7) 3 safe details enclosed
		Wraps: (8) 9: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *secondary.withSecondaryError (6) *withstack.withStack (7) *safedetails.withSafeDetails (8) *errors.errorString

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@7425e857e62fe4280f614f9076f310322cc78649:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@8b91062f9351d18f9104aff567cb152df162021e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@knz
Copy link
Contributor

knz commented Aug 31, 2020

Symptom: replication fails to converge after 10mn

@irfansharif
Copy link
Contributor

+cc @nvanbenschoten, @tbg for triage/routing.

@knz knz removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 31, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@57e160b1fcc41dd12b595953729728007fd3fbda:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten nvanbenschoten added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 2, 2020
@tbg tbg assigned tbg and unassigned darinpp Sep 8, 2020
@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 8, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@38115d0cc366243bcbae1658057cb0438e23565e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).replicate/wide failed on master@dc5544839735faaa04075e0d9e021ddba721f3bb:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide

See this test on roachdash
powered by pkg/cmd/internal/issues

tbg added a commit to tbg/cockroach that referenced this issue Nov 25, 2020
This used to live in the replicate queue, but there are other
entry points to replication changes, notably the store rebalancer
which caused cockroachdb#54444.

Move the check in the guts of replication changes where it is
guaranteed to be invoked.

Fixes cockroachdb#50729
Touches cockroachdb#54444 (release-20.2)

Release note (bug fix): in rare situations, an automated replication
change could result in a loss of quorum. This would require down nodes
and a simultaneous change in the replication factor. Note that a change
in the replication factor can occur automatically if the cluster is
comprised of less than five available nodes. Experimentally the likeli-
hood of encountering this issue, even under contrived conditions, was
small.
craig bot pushed a commit that referenced this issue Dec 3, 2020
56735: kvserverpb: move quorum safeguard into execChangeReplicasTxn r=aayushshah15 a=tbg

This used to live in the replicate queue, but there are other
entry points to replication changes, notably the store rebalancer
which caused #54444.

Move the check in the guts of replication changes where it is
guaranteed to be invoked.

Fixes #50729
Touches #54444 (release-20.2)

@aayushshah15 only requesting your review since you're in the area.
Feel free to opt out.

Release note (bug fix): in rare situations, an automated replication
change could result in a loss of quorum. This would require down nodes
and a simultaneous change in the replication factor. Note that a change
in the replication factor can occur automatically if the cluster is
comprised of less than five available nodes. Experimentally the likeli-
hood of encountering this issue, even under contrived conditions, was
small.


Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 5178559 Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants