stability: raft panic on delta #9946

mberhault · 2016-10-13T13:51:46Z

sha: ce09bd8 (from Oct 6th)

Node 104.196.31.237 died with:

panic: 6 state.commit 780682 is out of range [780994, 781065] [recovered]
        panic: 6 state.commit 780682 is out of range [780994, 781065]

goroutine 161 [running]:
panic(0x161be60, 0xc425842270)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/cli.initBacktrace.func2(0x161be60, 0xc425842270)
        /go/src/github.com/cockroachdb/cockroach/cli/backtrace.go:98 +0xca
github.com/cockroachdb/cockroach/util/stop.(*Stopper).Recover(0xc420408000)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:177 +0x56
panic(0x161be60, 0xc425842270)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc4258421e0, 0x17fe054, 0x2b, 0xc425e2f540, 0x4, 0x4)
        /go/src/github.com/cockroachdb/cockroach/storage/raft.go:111 +0x107
github.com/coreos/etcd/raft.(*raft).loadState(0xc423056780, 0x45a, 0x1, 0xbe98a, 0x0, 0x0, 0x0)
        /go/src/github.com/coreos/etcd/raft/raft.go:1091 +0x1db
github.com/coreos/etcd/raft.newRaft(0xc4270672d0, 0x0)
        /go/src/github.com/coreos/etcd/raft/raft.go:289 +0xc9b
github.com/coreos/etcd/raft.NewRawNode(0xc4270672d0, 0x0, 0x0, 0x0, 0x2575de0, 0x2205890, 0x196da20)
        /go/src/github.com/coreos/etcd/raft/rawnode.go:79 +0x71
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroupLocked(0xc420d03500, 0x196c000, 0xc4270677f0, 0xa7a66b, 0xc420d035a0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:414 +0x28a
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroup(0xc420d03500, 0xc4270677f0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:475 +0x92
github.com/cockroachdb/cockroach/storage.(*Store).processRaftRequest(0xc4201c2c00, 0x7f091fc590a0, 0xc424bf1920, 0xc42440c780, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2576 +0xb83
github.com/cockroachdb/cockroach/storage.(*Store).processRequestQueue(0xc4201c2c00, 0x2569)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2774 +0x180
github.com/cockroachdb/cockroach/storage.(*raftScheduler).worker(0xc4203efce0, 0xc420408000)
        /go/src/github.com/cockroachdb/cockroach/storage/scheduler.go:204 +0x308
github.com/cockroachdb/cockroach/storage.(*raftScheduler).Start.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/scheduler.go:160 +0x33
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc420408000, 0xc4203f8960)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:191 +0x7d
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:192 +0x66

The text was updated successfully, but these errors were encountered:

mberhault · 2016-10-13T13:52:37Z

upgraded to sha 9ef38d7, still failing.

petermattis · 2016-10-13T13:53:45Z

That SHA does not contain PR #9814. But the stack trace here is different and we're definitely holding Replica.raftMu so its possible this is a completely different problem than #9814.

petermattis · 2016-10-13T13:58:08Z

#9841 was a discrepancy between the on-disk and in-memory state. Since the node is still failing with a new SHA, we've corrupted the on-disk Raft state for the range. That's worrisome.

@mberhault Is it only one node that is crashing like this, or are there multiple?

@bdarnell I'll try to do some debugging on this today, but could use assistance here.

mberhault · 2016-10-13T14:08:12Z

it's the only one dying that way

bdarnell · 2016-10-13T15:37:08Z

I'm looking, but there's not much help in the logs because it's restarted so much that we no longer have logs for the first failure. Backtrace appears to be failing; the following line appears before each failure:

[bcd] 2016/10/13 11:53:41 Tracer failed to run: fork/exec /opt/backtrace/bin/ptrace
: cannot allocate memory

I'm not sure whether that's an actual memory problem or not.

From cockroach debug raft-log /mnt/data1 9577 (the range ID appears in the full log file but not in the panic quoted above), we see that the oldest remaining log entry (which is consistent with the indexes given in the message) is empty, indicating either an election or a failed membership change:

/Local/RangeID/9577/u/RaftLog/logIndex:780995: Term:1749 Index:780995 : EMPTY

The failing replica is n7,s13,r9577. Here's the last few range descriptor changes for this range, as observed by the store n7,s13. Note that that store is not a member of the range, nor has it been for the last several membership changes:

1476316704.227014825,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-
53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/50556289
0517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/5058981451473
12142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b
0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\3
02\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replic
as:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_i
d:2 > next_replica_id:6 

1476312198.261206535,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-
53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/50556289
0517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/5058981451473
12142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b
0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\3
02\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:9 store_id:18 replica_id:5 > next_replica_id:6 

1476311436.012667533,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:5

n7,s13 appears several days earlier as replica_id:1, so it was present when this range was created via split:

1476025232.695883253,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-
53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/50556289
0517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/5058981451473
12142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
        Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:7 store_id:13 replica_id:1 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:9 store_id:17 replica_id:3 > next_replica_id:4

The other store on node 7 never appears as a member of this range, which appears to rule out any problems related to the two-stores-per-node configuration. But I don't understand why not only has this replica not been garbage collected, it has also been keeping up to date with membership changes that it shouldn't have known about.

bdarnell · 2016-10-18T04:26:20Z

The most recent copies of range descriptor 9577 are:

[email protected]: 1476316704.227014825,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors1.txt:[email protected]:    Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:6 
--
[email protected]: 1476678090.460871340,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:    Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:2 store_id:4 replica_id:7 > next_replica_id:8 
[email protected]: 1476678090.460871340,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:2 store_id:4 replica_id:7 > next_replica_id:8 
[email protected]: 1476678090.460871340,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:2 store_id:4 replica_id:7 > next_replica_id:8

So there are three replicas in agreement and four days ahead of the one that's crashing.

We see a replica change transaction involving store 13 here:

[email protected]: 1476678083.311846024,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:7 
--
[email protected]: 1476322983.106191807,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > replicas:<node_id:7 store_id:13 replica_id:6 > next_replica_id:7 
--
[email protected]: 1476316704.227014825,0 /Local/Range/"\xbb\x89\xfd\a\x04\x1e\xc54\xeb{?\x12ef61a5b0-53c6-4dc5-be38-cbebfccda85b\x00\x01\xf8\ni+"/RangeDescriptor: [/Table/51/1/505562890517576511/"ef61a5b0-53c6-4dc5-be38-cbebfccda85b"/682283, /Table/51/1/505898145147312142/"fea22909-aafc-4dd6-970b-1ba5c50f4498"/1942317)
range-descriptors2.txt:[email protected]:   Raw:range_id:9577 start_key:"\273\211\375\007\004\036\3054\353{?\022ef61a5b0-53c6-4dc5-be38-cbebfccda85b\000\001\370\ni+" end_key:"\273\211\375\007\005O\256\302\010\330\016\022fea22909-aafc-4dd6-970b-1ba5c50f4498\000\001\370\035\243-" replicas:<node_id:5 store_id:10 replica_id:4 > replicas:<node_id:10 store_id:20 replica_id:2 > next_replica_id:6

The replica was added to store 13 at Thu Oct 13 09:43:03 2016, but the latest copy of the range descriptor on store 13 was Thu Oct 13 07:58:24 2016. The replica was removed four days later (I think the four day gap is just the time when the cluster was down) I think what happened is that something when wrong while store 13 was applying its preemptive snapshot, and so it was non-functional after being added.

I don't see anything here that explains why nodes other than the one with the raft state corruption are running out of memory and crashing, so that's the next thing to investigate.

I think we will likely be able to recover this without wiping the whole cluster by wiping store 13 and letting the cluster repair itself. But we need to figure out why other nodes are failing first.

bdarnell · 2016-10-18T09:07:09Z

Filed a separate issue for the memory usage (which looks unrelated): #10050

petermattis · 2016-12-01T17:35:47Z

@bdarnell Anything else you think we should investigate here? There were a number of bugs fixed recently which might have been related.

tamird · 2016-12-01T17:37:35Z

This is probably #11591, no?

EDIT: no, probably not, since the raft log is not empty.

bdarnell · 2016-12-02T04:10:30Z

It looks more like #9037, which was fixed in #10690. I think we can consider this closed unless it starts happening again.

mberhault assigned petermattis Oct 13, 2016

mberhault assigned bdarnell Oct 13, 2016

petermattis removed their assignment Dec 1, 2016

bdarnell closed this as completed Dec 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: raft panic on delta #9946

stability: raft panic on delta #9946

mberhault commented Oct 13, 2016

mberhault commented Oct 13, 2016

petermattis commented Oct 13, 2016

petermattis commented Oct 13, 2016

mberhault commented Oct 13, 2016

bdarnell commented Oct 13, 2016

bdarnell commented Oct 18, 2016

bdarnell commented Oct 18, 2016

petermattis commented Dec 1, 2016

tamird commented Dec 1, 2016 •

edited

Loading

bdarnell commented Dec 2, 2016

stability: raft panic on delta #9946

stability: raft panic on delta #9946

Comments

mberhault commented Oct 13, 2016

mberhault commented Oct 13, 2016

petermattis commented Oct 13, 2016

petermattis commented Oct 13, 2016

mberhault commented Oct 13, 2016

bdarnell commented Oct 13, 2016

bdarnell commented Oct 18, 2016

bdarnell commented Oct 18, 2016

petermattis commented Dec 1, 2016

tamird commented Dec 1, 2016 • edited Loading

bdarnell commented Dec 2, 2016

tamird commented Dec 1, 2016 •

edited

Loading