Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: raft panic on delta #9946

Closed
mberhault opened this issue Oct 13, 2016 · 10 comments
Closed

stability: raft panic on delta #9946

mberhault opened this issue Oct 13, 2016 · 10 comments
Assignees

Comments

@mberhault
Copy link
Contributor

sha: ce09bd8 (from Oct 6th)

Node 104.196.31.237 died with:

panic: 6 state.commit 780682 is out of range [780994, 781065] [recovered]
        panic: 6 state.commit 780682 is out of range [780994, 781065]

goroutine 161 [running]:
panic(0x161be60, 0xc425842270)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/cli.initBacktrace.func2(0x161be60, 0xc425842270)
        /go/src/github.com/cockroachdb/cockroach/cli/backtrace.go:98 +0xca
github.com/cockroachdb/cockroach/util/stop.(*Stopper).Recover(0xc420408000)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:177 +0x56
panic(0x161be60, 0xc425842270)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc4258421e0, 0x17fe054, 0x2b, 0xc425e2f540, 0x4, 0x4)
        /go/src/github.com/cockroachdb/cockroach/storage/raft.go:111 +0x107
github.com/coreos/etcd/raft.(*raft).loadState(0xc423056780, 0x45a, 0x1, 0xbe98a, 0x0, 0x0, 0x0)
        /go/src/github.com/coreos/etcd/raft/raft.go:1091 +0x1db
github.com/coreos/etcd/raft.newRaft(0xc4270672d0, 0x0)
        /go/src/github.com/coreos/etcd/raft/raft.go:289 +0xc9b
github.com/coreos/etcd/raft.NewRawNode(0xc4270672d0, 0x0, 0x0, 0x0, 0x2575de0, 0x2205890, 0x196da20)
        /go/src/github.com/coreos/etcd/raft/rawnode.go:79 +0x71
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroupLocked(0xc420d03500, 0x196c000, 0xc4270677f0, 0xa7a66b, 0xc420d035a0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:414 +0x28a
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroup(0xc420d03500, 0xc4270677f0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:475 +0x92
github.com/cockroachdb/cockroach/storage.(*Store).processRaftRequest(0xc4201c2c00, 0x7f091fc590a0, 0xc424bf1920, 0xc42440c780, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2576 +0xb83
github.com/cockroachdb/cockroach/storage.(*Store).processRequestQueue(0xc4201c2c00, 0x2569)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2774 +0x180
github.com/cockroachdb/cockroach/storage.(*raftScheduler).worker(0xc4203efce0, 0xc420408000)
        /go/src/github.com/cockroachdb/cockroach/storage/scheduler.go:204 +0x308
github.com/cockroachdb/cockroach/storage.(*raftScheduler).Start.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/scheduler.go:160 +0x33
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc420408000, 0xc4203f8960)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:191 +0x7d
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:192 +0x66
@mberhault
Copy link
Contributor Author

upgraded to sha 9ef38d7, still failing.

@petermattis
Copy link
Collaborator

That SHA does not contain PR #9814. But the stack trace here is different and we're definitely holding Replica.raftMu so its possible this is a completely different problem than #9814.

@petermattis
Copy link
Collaborator

#9841 was a discrepancy between the on-disk and in-memory state. Since the node is still failing with a new SHA, we've corrupted the on-disk Raft state for the range. That's worrisome.

@mberhault Is it only one node that is crashing like this, or are there multiple?

@bdarnell I'll try to do some debugging on this today, but could use assistance here.

@mberhault
Copy link
Contributor Author

it's the only one dying that way

@bdarnell
Copy link
Contributor

I'm looking, but there's not much help in the logs because it's restarted so much that we no longer have logs for the first failure. Backtrace appears to be failing; the following line appears before each failure:

[bcd] 2016/10/13 11:53:41 Tracer failed to run: fork/exec /opt/backtrace/bin/ptrace
: cannot allocate memory

I'm not sure whether that's an actual memory problem or not.

From cockroach debug raft-log /mnt/data1 9577 (the range ID appears in the full log file but not in the panic quoted above), we see that the oldest remaining log entry (which is consistent with the indexes given in the message) is empty, indicating either an election or a failed membership change:

/Local/RangeID/9577/u/RaftLog/logIndex:780995: Term:1749 Index:780995 : EMPTY

The failing replica is n7,s13,r9577. Here's the last few range descriptor changes for this range, as observed by the store n7,s13. Note that that store is not a member of the range, nor has it been for the last several membership changes:

1476316704.227014825,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-
53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/50556289
0517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/5058981451473
12142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b
0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\3
02\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replic
as:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_i
d:2 > next_replica_id:6 

1476312198.261206535,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-
53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/50556289
0517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/5058981451473
12142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b
0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\3
02\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:9 store_id:18 replica_id:5 > next_replica_id:6 

1476311436.012667533,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:5 

n7,s13 appears several days earlier as replica_id:1, so it was present when this range was created via split:

1476025232.695883253,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-
53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/50556289
0517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/5058981451473
12142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:7 store_id:13 replica_id:1 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:9 store_id:17 replica_id:3 > next_replica_id:4 

The other store on node 7 never appears as a member of this range, which appears to rule out any problems related to the two-stores-per-node configuration. But I don't understand why not only has this replica not been garbage collected, it has also been keeping up to date with membership changes that it shouldn't have known about.

@bdarnell
Copy link
Contributor

The most recent copies of range descriptor 9577 are:

[email protected]: 1476316704.227014825,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors1.txt:[email protected]:    Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:6 
--
[email protected]: 1476678090.460871340,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:    Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:2 store_id:4 replica_id:7 > next_replica_id:8 
[email protected]: 1476678090.460871340,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:2 store_id:4 replica_id:7 > next_replica_id:8 
[email protected]: 1476678090.460871340,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:2 store_id:4 replica_id:7 > next_replica_id:8 

So there are three replicas in agreement and four days ahead of the one that's crashing.

We see a replica change transaction involving store 13 here:

[email protected]: 1476678083.311846024,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:7 
--
[email protected]: 1476322983.106191807,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:7 store_id:13 replica_id:6 > next_replica_id:7 
--
[email protected]: 1476316704.227014825,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:6 

The replica was added to store 13 at Thu Oct 13 09:43:03 2016, but the latest copy of the range descriptor on store 13 was Thu Oct 13 07:58:24 2016. The replica was removed four days later (I think the four day gap is just the time when the cluster was down) I think what happened is that something when wrong while store 13 was applying its preemptive snapshot, and so it was non-functional after being added.

I don't see anything here that explains why nodes other than the one with the raft state corruption are running out of memory and crashing, so that's the next thing to investigate.

I think we will likely be able to recover this without wiping the whole cluster by wiping store 13 and letting the cluster repair itself. But we need to figure out why other nodes are failing first.

@bdarnell
Copy link
Contributor

Filed a separate issue for the memory usage (which looks unrelated): #10050

@petermattis
Copy link
Collaborator

@bdarnell Anything else you think we should investigate here? There were a number of bugs fixed recently which might have been related.

@petermattis petermattis removed their assignment Dec 1, 2016
@tamird
Copy link
Contributor

tamird commented Dec 1, 2016

This is probably #11591, no?

EDIT: no, probably not, since the raft log is not empty.

@bdarnell
Copy link
Contributor

bdarnell commented Dec 2, 2016

It looks more like #9037, which was fixed in #10690. I think we can consider this closed unless it starts happening again.

@bdarnell bdarnell closed this as completed Dec 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants