Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestReplicateRogueRemovedNode failed under stress #41466

Closed
cockroach-teamcity opened this issue Oct 9, 2019 · 5 comments · Fixed by #41577
Closed

storage: TestReplicateRogueRemovedNode failed under stress #41466

cockroach-teamcity opened this issue Oct 9, 2019 · 5 comments · Fixed by #41577
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/06026bab5cdb3695ddeba209c94175cd9460b170

Parameters:

TAGS=
GOFLAGS=-race -parallel=2

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestReplicateRogueRemovedNode PKG=github.com/cockroachdb/cockroach/pkg/storage TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1529580&tab=buildLog


I191009 13:46:37.389866 155128 gossip/gossip.go:394  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:41247" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 cluster_name:"" sql_address:<network_field:"" address_field:"" > 
W191009 13:46:37.469762 155128 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
I191009 13:46:37.470751 155128 gossip/gossip.go:394  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:46037" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 cluster_name:"" sql_address:<network_field:"" address_field:"" > 
I191009 13:46:37.473391 155115 gossip/client.go:124  [n2] started gossip client to 127.0.0.1:41247
W191009 13:46:37.601306 155128 gossip/gossip.go:1517  [n3] no incoming or outgoing connections
I191009 13:46:37.602579 155128 gossip/gossip.go:394  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:33251" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 cluster_name:"" sql_address:<network_field:"" address_field:"" > 
I191009 13:46:37.605071 155378 gossip/client.go:124  [n3] started gossip client to 127.0.0.1:41247
I191009 13:46:37.708934 155128 storage/replica_command.go:1586  [s1,r1/1:/M{in-ax}] change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r1:/M{in-ax} [(n1,s1):1, next=2, gen=0]
I191009 13:46:37.723687 155128 storage/replica_raft.go:291  [s1,r1/1:/M{in-ax},txn=aebd41b8] proposing ADD_REPLICA[(n2,s2):2LEARNER]: after=[(n1,s1):1 (n2,s2):2LEARNER] next=3
I191009 13:46:37.747437 155128 storage/store_snapshot.go:978  [s1,r1/1:/M{in-ax}] sending LEARNER snapshot 9ea31f8a at applied index 20
I191009 13:46:37.771366 155128 storage/store_snapshot.go:1021  [s1,r1/1:/M{in-ax}] streamed snapshot to (n2,s2):2LEARNER: kv pairs: 65, log entries: 0, rate-limit: 8.0 MiB/sec, 0.02s
I191009 13:46:37.781668 155539 storage/replica_raftstorage.go:794  [s2,r1/2:{-}] applying LEARNER snapshot [id=9ea31f8a index=20]
I191009 13:46:37.792427 155539 storage/replica_raftstorage.go:815  [s2,r1/2:/M{in-ax}] applied LEARNER snapshot [total=10ms ingestion=4@1ms id=9ea31f8a index=20]
I191009 13:46:37.811038 155128 storage/replica_command.go:1586  [s1,r1/1:/M{in-ax}] change replicas (add [(n2,s2):2] remove []): existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=1]
I191009 13:46:37.858450 155128 storage/replica_raft.go:291  [s1,r1/1:/M{in-ax},txn=4a962a10] proposing ADD_REPLICA[(n2,s2):2]: after=[(n1,s1):1 (n2,s2):2] next=3
I191009 13:46:37.911571 155128 storage/replica_command.go:1586  [s1,r1/1:/M{in-ax}] change replicas (add [(n3,s3):3LEARNER] remove []): existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, next=3, gen=2]
I191009 13:46:37.940694 155128 storage/replica_raft.go:291  [s1,r1/1:/M{in-ax},txn=1de48c65] proposing ADD_REPLICA[(n3,s3):3LEARNER]: after=[(n1,s1):1 (n2,s2):2 (n3,s3):3LEARNER] next=4
I191009 13:46:37.997036 155128 storage/store_snapshot.go:978  [s1,r1/1:/M{in-ax}] sending LEARNER snapshot 49ed8f93 at applied index 25
I191009 13:46:37.999805 155128 storage/store_snapshot.go:1021  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):3LEARNER: kv pairs: 72, log entries: 0, rate-limit: 8.0 MiB/sec, 0.02s
I191009 13:46:38.004350 155768 storage/replica_raftstorage.go:794  [s3,r1/3:{-}] applying LEARNER snapshot [id=49ed8f93 index=25]
I191009 13:46:38.008232 155768 storage/replica_raftstorage.go:815  [s3,r1/3:/M{in-ax}] applied LEARNER snapshot [total=3ms ingestion=4@1ms id=49ed8f93 index=25]
I191009 13:46:38.030970 155128 storage/replica_command.go:1586  [s1,r1/1:/M{in-ax}] change replicas (add [(n3,s3):3] remove []): existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3LEARNER, next=4, gen=3]
I191009 13:46:38.090145 155128 storage/replica_raft.go:291  [s1,r1/1:/M{in-ax},txn=06a6ce04] proposing ADD_REPLICA[(n3,s3):3]: after=[(n1,s1):1 (n2,s2):2 (n3,s3):3] next=4
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 9, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Oct 9, 2019
@bdarnell
Copy link
Contributor

Not sure why it didn't pull out the right stack trace, but the real failure is a nil pointer exception in HandleRaftUncoalescedRequest. Looks like we've seen the same stack trace back in 2.1 via sentry in #36361, but closed it as unreproducible.

[13:46:39] :	 [Step 2/2] panic: runtime error: invalid memory address or nil pointer dereference [recovered]
[13:46:39] :	 [Step 2/2] 	panic: runtime error: invalid memory address or nil pointer dereference
[13:46:39] :	 [Step 2/2] [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ff0ba3]
[13:46:39] :	 [Step 2/2] 
[13:46:39] :	 [Step 2/2] goroutine 155787 [running]:
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc002256960, 0x5ac8980, 0xc0078f8930)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x155
[13:46:39] :	 [Step 2/2] panic(0x47c8ce0, 0x8063660)
[13:46:39] :	 [Step 2/2] 	/usr/local/go/src/runtime/panic.go:522 +0x1b5
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleRaftUncoalescedRequest(0x0, 0x5ac8980, 0xc0078f8930, 0xc00004c800, 0x5a8ba40, 0xc0002f1520, 0xc00127c278)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store_raft.go:148 +0xa3
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage.(*Store).uncoalesceBeats(0x0, 0x5ac8980, 0xc0078f8930, 0xc00127c2d0, 0x1, 0x1, 0x100000001, 0x0, 0x0, 0x300000003, ...)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store_raft.go:110 +0x34c
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleRaftRequest(0x0, 0x5ac8980, 0xc0078f8930, 0xc00004c600, 0x5a8ba40, 0xc0002f1520, 0xc00127c270)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store_raft.go:129 +0x248
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage_test.(*mtcStoreRaftMessageHandler).HandleRaftRequest(0xc002930f30, 0x5ac8980, 0xc0078f8930, 0xc00004c600, 0x5a8ba40, 0xc0002f1520, 0x0)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_helpers_test.go:136 +0xb8
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage_test.(*unreliableRaftHandler).HandleRaftRequest(0xc0065bf3c0, 0x5ac8980, 0xc0078f8930, 0xc00004c600, 0x5a8ba40, 0xc0002f1520, 0x58)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/client_raft_helpers_test.go:83 +0x39e
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).handleRaftRequest(0xc007334000, 0x5ac8980, 0xc0078f8930, 0xc00004c400, 0x5a8ba40, 0xc0002f1520, 0x0)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:299 +0x2e0
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftMessageBatch.func1.1.1(0x5b19de0, 0xc0025fb2a0, 0xc007334000, 0x5ac8980, 0xc0078f8930, 0xc00813bf58, 0x69115f)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:370 +0x248
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftMessageBatch.func1.1(0x5ac8980, 0xc0078f8930)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:378 +0x75
[13:46:39] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0025fb2c0, 0xc002256960, 0xc0078f8900)
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x160
[13:46:39] :	 [Step 2/2] created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
[13:46:39] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xc4

@bdarnell
Copy link
Contributor

The crash is because we receive a message type that's not in this map:

sm.raftRcvdMessages[raftpb.MsgProp] = sm.RaftRcvdMsgProp
sm.raftRcvdMessages[raftpb.MsgApp] = sm.RaftRcvdMsgApp
sm.raftRcvdMessages[raftpb.MsgAppResp] = sm.RaftRcvdMsgAppResp
sm.raftRcvdMessages[raftpb.MsgVote] = sm.RaftRcvdMsgVote
sm.raftRcvdMessages[raftpb.MsgVoteResp] = sm.RaftRcvdMsgVoteResp
sm.raftRcvdMessages[raftpb.MsgPreVote] = sm.RaftRcvdMsgPreVote
sm.raftRcvdMessages[raftpb.MsgPreVoteResp] = sm.RaftRcvdMsgPreVoteResp
sm.raftRcvdMessages[raftpb.MsgSnap] = sm.RaftRcvdMsgSnap
sm.raftRcvdMessages[raftpb.MsgHeartbeat] = sm.RaftRcvdMsgHeartbeat
sm.raftRcvdMessages[raftpb.MsgHeartbeatResp] = sm.RaftRcvdMsgHeartbeatResp
sm.raftRcvdMessages[raftpb.MsgTransferLeader] = sm.RaftRcvdMsgTransferLeader
sm.raftRcvdMessages[raftpb.MsgTimeoutNow] = sm.RaftRcvdMsgTimeoutNow

Compare this to the upstream list https://github.com/etcd-io/etcd/blob/3ef2ad8e115449a7004b628a873e2629855ed468/raft/raftpb/raft.proto#L37-L55. We're missing MsgHup, MsgBeat, MsgUnreachable, MsgSnapStatus, MsgCheckQuorum, MsgReadIndex, and MsgReadIndexResp.

MsgReadIndex{,Resp} is related to a feature we don't use. The rest are all (as far as I know) internal "fake" message types that are never actually sent over the wire or seen by anything outside the raft code. It reproduces in about 10 minutes so I'll add some logging to see what message type we're encountering here.

@nvanbenschoten
Copy link
Member

I was looking into this because I hit it in #41507. I found that the message type was a MsgApp, but that the Store receiver was nil.

@bdarnell
Copy link
Contributor

Ah, in that case the recent appearance of this issue means it may be tied to the test refactoring in #40892. cc @ajwerner

@ajwerner ajwerner assigned ajwerner and unassigned bdarnell Oct 14, 2019
@ajwerner
Copy link
Contributor

Ack

craig bot pushed a commit that referenced this issue Oct 14, 2019
41577: storage: Deflake TestReplicateRogueRemovedNode r=ajwerner a=bdarnell

Some recently-refactored test harness code needed to check for nil
stores.

Fixes #41466

Release note: None

41579: sqlsmith: fix merge join to exit when it's out of indexes r=mjibson a=mjibson

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Matt Jibson <[email protected]>
@craig craig bot closed this as completed in 8315fd5 Oct 14, 2019
bdarnell added a commit to bdarnell/cockroach that referenced this issue Oct 16, 2019
Some recently-refactored test harness code needed to check for nil
stores.

Fixes cockroachdb#41466

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants