stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224

tbg · 2016-06-14T20:33:26Z

summary:

gamma cluster upgraded from 093e1e3 to 0b950ce crashed multiple nodes with stability: panic on basictracer.errAssertionFailed #7185
attempted to roll back to 093e1e3, Nodes 4 (104.196.98.24) and 5 (104.196.39.184) crashed with:

F160613 16:02:44.438105 storage/store.go:2186  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
goroutine 104 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecef0d001, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x2879de0, 0x3, 0x228d6f3, 0x10, 0x88a, 0xc8249602d0, 0x4a)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:728 +0x9c5
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1b617a0, 0x2a, 0xc822f7b980, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1b617a0, 0x2a, 0xc822f7b980, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1b617a0, 0x2a, 0xc822f7b980, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc82018e280, 0xe53, 0x200000002, 0xc800000005)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2186 +0x288
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc82018e280, 0xc8243dc7e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:1985 +0x4e5
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2066 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc820343570, 0xc82139cbb0)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

Attempted upgrade to 0b950ce, still crashing with same trace.

Node 4 is crashing with:

F160614 20:26:27.555255 storage/store.go:2243  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}

Node 5 is crashing with:

F160614 20:26:31.004285 storage/store.go:2243  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}

The text was updated successfully, but these errors were encountered:

tbg · 2016-06-14T20:36:34Z

clobbering {2 2 5} with {6 6 5} means that we had ReplicaID=5 stored as NodeID=2, StoreID=2 but that an attempt was made to cache NodeID=6 StoreID=6. Can you pull a data dump so that we can preserve that state? We should also improve that message - it doesn't even tell us the RangeID.

mberhault · 2016-06-14T20:38:16Z

backing up data/logs on nodes 4 and 5. It'll take a little while. data dir is ~80GiB
locations: /mnt/data/backup.7224 and ~/logs.7224

bdarnell · 2016-06-14T20:38:30Z

There's another instance of this in #7184.

tbg · 2016-06-14T20:39:42Z

Thanks @mberhault. I'll look into the data files and will try to find out which Range is the offending one, and what replicas and versions of meta entries we have. We shouldn't be seeing two different replicas with the same ReplicaID, naturally.

mberhault · 2016-06-14T20:41:40Z

@bdarnell: sorry, that's where my update went. I apparently mistakenly pasted the "gamma rollback -> clobber!" problem to the wrong bug, it was supposed to be part of this one. I"ll shuffle things around a bit

tbg · 2016-06-14T20:45:15Z

We should also back up nodes 1,2,3 (god knows where the meta ranges are, or who sends the Raft message that kills things)

tbg · 2016-06-14T20:59:08Z

The stacktrace is s.cacheReplicaDescriptorLocked(req.GroupID, req.FromReplica). How are the NodeIDs distributed? The panic posted above hints at Node6 sending a message from a ReplicaID which is also known to the recipient for another peer. Is the crash message the same for both failing nodes?

See cockroachdb#7224.

mberhault · 2016-06-14T21:06:07Z

I updated the original issue message with more details, including the message on each node:
Node 4 is crashing with:

F160614 20:26:27.555255 storage/store.go:2243  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}

Node 5 is crashing with:

F160614 20:26:31.004285 storage/store.go:2243  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}

tbg · 2016-06-14T21:09:12Z

If I were to take a guess, I would guess that the message comes in both orders - can you verify that for Node5 (i.e. try a bunch of times, you should see random between the two messages)

mberhault · 2016-06-14T21:13:15Z

right you are. no need to run it again, just grep through the logs:

4: 104.196.98.24
logs/cockroach.stderr:F160613 15:55:40.307717 storage/store.go:2226  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}
logs/cockroach.stderr:F160613 16:02:49.620753 storage/store.go:2186  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160613 17:09:43.303798 storage/store.go:2186  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160614 20:26:27.555255 storage/store.go:2243  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}

5: 104.196.39.184
logs/cockroach.stderr:F160613 15:49:20.269201 storage/store.go:2226  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160613 15:55:56.176292 storage/store.go:2186  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}
logs/cockroach.stderr:F160613 16:02:44.438105 storage/store.go:2186  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160614 20:26:31.004285 storage/store.go:2243  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}

mberhault · 2016-06-14T21:21:53Z

Messages with #7227 patched in:

Node 4:

E160614 21:19:56.316503 storage/replica_command.go:1877  replica: range=2341 [/System/"tsd\x12cr.store.intentbytes\x00\x01\x89\xf8\x066\x014"-/System/"tsd\x12cr.store.livebytes\x00\x01\x89\xf8\x065\xe93"), checksum mismatch: e = 3d25dd6a5b188a71efc8e21a5609eed686327cb99194247f676cb0f4f35b3296f73c3b22769737a513f5c1f96dcb82d76a298bae8e2b00eca02fd0ed165bf744, v = 4eccc4fb1f914635699566c9eb6a099f44fdf778c330f51e8b6dc744c19c21c640205d8573346035841d88f95eb823ce89fec2fcc822b766ee879f661881860f
F160614 21:19:56.579455 storage/store.go:2248  store 3667, range 4: clobbering {NodeID:2 StoreID:2 ReplicaID:5} with {NodeID:6 StoreID:6 ReplicaID:5} in replicaDescCache; have replica range=3667 [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522-/Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
goroutine 86 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecef26b01, 0x0, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x27eeea0, 0x3, 0x221d6cc, 0x10, 0x8c8, 0xc821019a40, 0x131)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:728 +0x9c5
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1b94300, 0x51, 0xc82471d9c8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1b94300, 0x51, 0xc82471d9c8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1b94300, 0x51, 0xc82471d9c8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc8202882c0, 0xe53, 0x600000006, 0xc800000005)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2248 +0x3ef
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc8202882c0, 0xc821bd7e60, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2041 +0x4e5
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2123 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8203741c0, 0xc820e9b350)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

Node 5:

E160614 21:20:00.988990 storage/replica_command.go:1877  replica: range=2390 [/System/"tsd\x12cr.node.sys.cpu.user.percent\x00\x01\x89\xf8\x065\xf45"-/System/"tsd\x12cr.node.sys.gc.pause.percent\x00\x01\x89\xf8\x066 2"), checksum mismatch: e = 81431216f559043e551181be5e6ac47a9cfa23c13bf7e9b504d02eb33825000a588b010f936d4ca68fed8ddf7f1ae8329cd67397130faec5f2d21ae7772e44fb, v = 701abbb298e2e346ed1e31b5cc9ee8ed33f440dd8fd13b53ff6e9dfb6a4e89e5f2695025a0fb65547230210430598e7bbf9a58266e1ed2259fc8bed90967f765
F160614 21:20:01.012564 storage/store.go:2248  store 3667, range 5: clobbering {NodeID:2 StoreID:2 ReplicaID:5} with {NodeID:6 StoreID:6 ReplicaID:5} in replicaDescCache; have replica range=3667 [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522-/Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
goroutine 60 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecef26c01, 0x0, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x27eeea0, 0x3, 0x221d6cc, 0x10, 0x8c8, 0xc8210b28c0, 0x131)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:728 +0x9c5
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1b94300, 0x51, 0xc823f9b9b8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1b94300, 0x51, 0xc823f9b9b8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1b94300, 0x51, 0xc823f9b9b8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc8201d82c0, 0xe53, 0x600000006, 0xc800000005)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2248 +0x3ef
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc8201d82c0, 0xc82422d9e0, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2041 +0x4e5
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2123 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc82031a310, 0xc820ebd160)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

tbg · 2016-06-14T21:23:48Z

Thanks, looks like I need to update the PR (store-id switched)

See cockroachdb#7224.

tbg · 2016-06-14T22:10:10Z

range 3667 is on nodes 2, 5, and 4. Here's the history for node4:

(nodes@last_assigned_replica_id), old to new
3,5,6@3
3,5,6,4@4
4,5,6
4,5
4,5,2@5

I'm going to run the same for node5 but am anticipating that we will have to look at Node 2 as well (which is running at the moment).

tbg · 2016-06-15T01:47:57Z

now wait for it:

node5:

3,5,6@3
3,5,6,4@4
4,5,6
4,5
4,5,6@5

This means that Node5 and Node4 have conflicting range descriptors - one thinks that 6 has been added at replicaID5, the other one thinks it's 2.
I'm going to have to double-check whether one of them may be an intent.

22:06 $ diff gamma{4,5}
1,3c1,3
< cockroach-gamma-4
< 1465832576.787583102,0 /Local/Range/"\xbb\x89\xfd\x02\b\xd8$\xf7F\xc3C\x12ea15ce2b-daec-4e93-b8b1-3e356489504b\x00\x01\xf8#\xb6\xaa"/RangeDescriptor: [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522, /Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
<   Raw:range_id:3667 start_key:"\273\211\375\002\010\330$\367F\303C\022ea15ce2b-daec-4e93-b8b1-3e356489504b\000\001\370#\266\252" end_key:"\273\211\375\002\020\327r}os5\022a34511ab-48b9-4c14-87cc-f18ec0ae991d\000\001\370\027\315~" replicas:<node_id:4 store_id:4 replica_id:4 > replicas:<node_id:5 store_id:5 replica_id:2 > replicas:<node_id:2 store_id:2 replica_id:5 > next_replica_id:6
---
> cockroach-gamma-5
> 1465832522.814160856,0 /Local/Range/"\xbb\x89\xfd\x02\b\xd8$\xf7F\xc3C\x12ea15ce2b-daec-4e93-b8b1-3e356489504b\x00\x01\xf8#\xb6\xaa"/RangeDescriptor: [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522, /Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
>   Raw:range_id:3667 start_key:"\273\211\375\002\010\330$\367F\303C\022ea15ce2b-daec-4e93-b8b1-3e356489504b\000\001\370#\266\252" end_key:"\273\211\375\002\020\327r}os5\022a34511ab-48b9-4c14-87cc-f18ec0ae991d\000\001\370\027\315~" replicas:<node_id:4 store_id:4 replica_id:4 > replicas:<node_id:5 store_id:5 replica_id:2 > replicas:<node_id:6 store_id:6 replica_id:5 > next_replica_id:6

tbg · 2016-06-15T01:49:16Z

@mberhault I think this is sufficiently broken to bring down the rest of the nodes and pull backups from the ones that haven't been backed up yet.

tbg · 2016-06-15T02:08:58Z

The timestamps in the diff above are Mon, 13 Jun 2016 15:42:56 GMT and Mon, 13 Jun 2016 15:42:02 GMT, respectively. That is, not even close.

tbg · 2016-06-15T02:31:30Z

ubuntu@cockroach-gamma-4:~$ grep ' 3667' logs.7224/cockroach.stderr
W160613 14:00:09.850845 raft/raft.go:593  [group 3667] 4 stepped down to follower since quorum is not active
I160613 15:42:09.760249 storage/replica_command.go:1531  range 3667: new leader lease replica {4 4 4} 2016-06-13 15:42:05.60192409 +0000 UTC 8.582510867s following replica {5 5 2} 2016-06-13 14:01:23.490070436 +0000 UTC 1h40m42.111853654s [physicalTime=2016-06-13 15:42:09.760204119 +0000 UTC]
W160613 15:42:12.831446 raft/raft.go:593  [group 3667] 4 stepped down to follower since quorum is not active
I160613 15:42:56.787411 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3238 in 2.085009622s. encoded size=35078995, 87412 KV pairs, 17 log entries
I160613 15:42:56.949600 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:5} for range 3667
I160613 15:43:15.633267 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3241 in 256.354269ms. encoded size=35081113, 87413 KV pairs, 20 log entries
I160613 15:49:00.406459 storage/replica_command.go:1538  range 3667: replicaID 4 transfer raft leadership to replicaID 2
W160613 15:55:02.876732 storage/raft_log_queue.go:114  storage/raft_log_queue.go:99: raft log's oldest index (0) is less than the first index (3253) for range 3667

We can see that Node4 itself proposed the replica change which led to its local range descriptor written at 15:42:56GMT. But, very interestingly, we see at 15:42:09 that the node takes the leader lease following node2 - that node is added to that range for the very first time just before, by node5 (the conflicting ChangeReplicas).

Shot in the blue: if, by some mechanism, we would sometimes delete intents that should commit, we'd probably see behaviour like this - the Raft group updated, but the on-disk state (at least partially) lost, similar to #5291.

tbg · 2016-06-15T02:35:05Z

~~Node5, meanwhile, isn't the leader and only receives a snapshot:~~ totally wrong, disregard this post and see third post below. This log below is actually from node 6.

logs/cockroach.stderr.3:I160613 15:42:29.706134 storage/replica_raftstorage.go:681  replica 5 received snapshot for range 3667 at index 3235. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr.3:I160613 15:42:30.728217 storage/replica_raftstorage.go:683  replica 5 applied snapshot for range 3667 in 1.022038112s

tbg · 2016-06-15T02:39:22Z

Node2 is also boring, it only receives a snapshot at a time between the first and second conflicting ChangeReplica (so it must be caused by the first, though it's awfully late for that)

logs/cockroach.stderr.2:I160613 15:43:16.272650 storage/replica_raftstorage.go:681  replica 5 received snapshot for range 3667 at index 3241. encoded size=35081113, 87413 KV pairs, 20 log entries
logs/cockroach.stderr.2:I160613 15:43:16.592619 storage/replica_raftstorage.go:683  replica 5 applied snapshot for range 3667 in 319.898665ms

tbg · 2016-06-15T02:48:38Z

Here's a breakdown of all propositions from the logs.

22:44 $ for l in $(cat ~/production/hosts/gamma); do echo $l; ssh $l "grep -R 'proposing ADD_REPLICA.*range 3667' ~/logs"; done
[email protected] #1
[email protected] #2
[email protected] #3
/home/ubuntu/logs/cockroach.stderr.2:I160610 16:06:55.125630 storage/replica.go:1385  raft: proposing ADD_REPLICA {4 4 4} for range 3667
[email protected] #4
/home/ubuntu/logs/cockroach.stderr:I160613 15:42:56.949600 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:5} for range 3667
[email protected] #5
/home/ubuntu/logs/cockroach.stderr:I160613 15:42:03.172227 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:6 StoreID:6 ReplicaID:5} for range 3667
[email protected] #6

tbg · 2016-06-15T02:52:48Z

Scratch the post three above, I can't use grep. Node5 really is in on the action:

ubuntu@cockroach-gamma-5:~$ grep -R ' 3667' logs/cockroach.stderr* | grep 160613
logs/cockroach.stderr:W160613 14:01:42.774373 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active
logs/cockroach.stderr:I160613 15:42:00.857693 storage/replica_command.go:1531  range 3667: new leader lease replica {5 5 2} 2016-06-13 14:01:23.490070436 +0000 UTC 1h40m42.111853654s following replica {4 4 4} 2016-06-10 22:11:24.129040512 +0000 UTC 63h49m59.361029924s [physicalTime=2016-06-13 15:42:00.857647871 +0000 UTC]
logs/cockroach.stderr:I160613 15:42:02.813966 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3229 in 1.955395552s. encoded size=35074262, 87410 KV pairs, 8 log entries
logs/cockroach.stderr:I160613 15:42:03.172227 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:6 StoreID:6 ReplicaID:5} for range 3667
logs/cockroach.stderr:I160613 15:42:08.862460 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3232 in 426.023892ms. encoded size=35076362, 87411 KV pairs, 11 log entries
logs/cockroach.stderr:I160613 15:42:09.751016 storage/replica_command.go:1538  range 3667: replicaID 2 transfer raft leadership to replicaID 4
logs/cockroach.stderr:I160613 15:42:22.476634 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 426.068352ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:I160613 15:42:23.715516 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 569.103901ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:I160613 15:42:25.336997 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 288.042422ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:W160613 15:42:25.631158 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active
logs/cockroach.stderr:I160613 15:42:27.735413 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 226.824326ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:W160613 15:44:09.030377 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active
logs/cockroach.stderr:I160613 15:49:00.418396 storage/replica_command.go:1531  range 3667: new leader lease replica {5 5 2} 2016-06-13 15:46:53.021458351 +0000 UTC 2m11.655136997s following replica {4 4 4} 2016-06-13 15:42:05.60192409 +0000 UTC 4m47.419534261s [physicalTime=2016-06-13 15:49:00.418359155 +0000 UTC]
logs/cockroach.stderr:W160613 15:49:07.832094 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active

It grabs the lease from node4, proposes its conf change, sends a few snapshots, and gives the lease back to 4 (which then does the conflicting replica change).

tbg · 2016-06-15T03:06:35Z

gist: all sent or received snapshots https://gist.github.com/tschottdorf/c9630dfce6a66c49a8e5c655053b0e2d

tl;dr (not in temporal order)
2 receives at 3241 -> from 4 after it's done its conflicting replica change
4 sends at 3238,3241 -> last one goes to 2
5 sends at 3229, 3232, 3235x4 -> goes to 6
6 receives at 3235 -> comes from 5

tbg · 2016-06-15T03:08:45Z

to comment on my "shot in the blue" earlier: node 4 and 5 have different versions of the same key, which should not be the cause of an operation which does the wrong thing on a replicated level. Should look at the Raft logs too to see what index they have applied.

See cockroachdb#7224.

cuongdo · 2016-06-29T15:37:45Z

This is happening pretty regularly with the TestRebalance3to5Small acceptance test. Here's a failed run.. Full node logs are in the "Artifacts" sub-tab.

Relevant excerpt from acceptance log:

F160629 01:06:04.139942 storage/store.go:2299  store {ClusterID:{UUID:d840ed4f-e385-42d2-8eb7-faddc8ef2b71} NodeID:2 StoreID:2}, range 1: clobbering {NodeID:3 StoreID:3 ReplicaID:3} with {NodeID:2 StoreID:2 ReplicaID:3} in replicaDescCache; have replica range=1 [/Min-/System/"update-cluster")
goroutine 33 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecf051500, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x29644c0, 0x3, 0x2342c26, 0x10, 0x8fb, 0xc820b4ed00, 0xf6)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:726 +0x178
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1c92680, 0x52, 0xc8206939c8, 0x5, 0x5)
        /go/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1c92680, 0x52, 0xc8206939c8, 0x5, 0x5)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1c92680, 0x52, 0xc8206939c8, 0x5, 0x5)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc820162000, 0x1, 0x200000002, 0xc800000003)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2299 +0x3e9
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc820162000, 0xc8205c97a0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2093 +0x515
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2174 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc820350cb0, 0xc820181640)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:142 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:143 +0x62

Cannot hurt to have this for cockroachdb#7224.

tbg · 2016-06-30T20:02:25Z

This repros quickly with

make test TESTTIMEOUT=48h PKG=./acceptance TESTS=Rebalance_3To5Small TESTFLAGS='-v --remote --key-name google_compute_engine --cwd=allocator_terraform -tf.keep-cluster-fail'

as pointed out by @cuongdo and @BramGruneir.

I'm investigating. Looks like the first node proposes the two conflicting upreplications that soon thereafter let it clobber. I hope it's the same issue as here because if so, this is hopefully well-understood soon.

tbg · 2016-06-30T21:44:53Z

Looks at least similar:

F160630 19:32:47.259015 storage/store.go:2328  store {ClusterID:{UUID:d840ed4f-e385-42d2-8eb7-faddc8ef2b71} NodeID:1 StoreID:1}, range 1: clobbering {NodeID:2 StoreID:2 ReplicaID:2} with {NodeID:3 StoreID:3 ReplicaID:2} in replicaDescCache; have replica range=1 [/Min-/System/"update-cluster")

i.e. node3 and node2 both at some point messaged us and claimed to have replicaID=2 for range 1.

Here are the descriptors for range 1:

1464889133.146965811,0 /Local/Range/""/RangeDescriptor: [/Min, /System/"update-cluster")
        Raw:range_id:1 start_key:"" end_key:"\004update-cluster" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4

1464886890.124237716,0 /Local/Range/""/RangeDescriptor: [/Min, /Table/11)
        Raw:range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4

1464886661.532168512,0 /Local/Range/""/RangeDescriptor: [/Min, /Table/11)
        Raw:range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3

1464803321.941369236,34 /Local/Range/""/RangeDescriptor: [/Min, /Table/11)
        Raw:range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2

1464803321.689672339,0 /Local/Range/""/RangeDescriptor: [/Min, /Max)
        Raw:range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2

That seems clean. Reading bottom to top, the range first splits, then adds store2 at replicaid=2, then adds store3 at replicaid=3. Then, finally, another split. But the panic message indicates something different, namely that both store3 and store2 talked to us claiming they had replica=2. And in fact, going through the info messages produced by the replica change, we see

I160630 19:30:32.375925 storage/replica_command.go:2709  change replicas of 1: read existing descriptor range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica
_id:1 > next_replica_id:2

and the second one

I160630 19:30:32.753520 storage/replica_command.go:2709  change replicas of 1: read existing descriptor range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:3 store_id:3 replica_id:2 > next_replica_id:3

This means that store3 actually has replicaID=2 and store2 should have replicaID=3. I'm going to look at their data files next.

tbg · 2016-06-30T21:50:33Z

Nodes 2 and 3 agree as far as the range descriptors and meta entries are concerned (the latter are read with ./cockroach debug keys data0 --from=$(echo -n -e '\003') --to=$(echo -n -e '\004') --values | less -S). Maybe I'm cross-eyed and was reading the debug output wrong - checking.

tbg · 2016-07-05T15:28:09Z

The plot is thickening. It looks like the descriptors we write are somehow messed up: when store 3 gets replica_id 2, we write a desc with replica_id=2. Meaning that we probably somewhere mix up replica_id and store_id. Which means that when store 3 gets its replica first, and then store 2, they will be mixed up - it should be (3,2) (2,3) but you get (3,3) (2,2). However, the in-memory state doesn't have that issue.

Still speculation, but I have a repro with binaries that I can put assertions into and a <10min repro cycle, so hope to nail this one soon.

The tests were sloppy in setting up the initial cluster with the downloaded data directories. There was a window of time during which the pre-restore cluster and the post-restore cluster had nodes running simultaneously. See cockroachdb#7224.

tbg · 2016-07-05T19:27:47Z

The wrong plot thickened: The allocator tests were mixing two separate clusters and the panic was an artifact of RangeDescriptors in disagreement. #7625
Back to square one on this one. I'll at least put some more useful debugging in.

The tests were sloppy in setting up the initial cluster with the downloaded data directories. There was a window of time during which the pre-restore cluster and the post-restore cluster had nodes running simultaneously. See cockroachdb#7224.

tbg · 2016-07-07T23:51:47Z

Just a heads up that we lost the logs for the gamma backup.

tamird · 2016-09-06T19:49:41Z

I believe this issue is no longer relevant since 2f1f092 went in. @tschottdorf?

tbg · 2016-09-06T19:53:05Z

The underlying bug may still be relevant since it has never been resolved, but the mechanism by which it would fail would have changed completely. I'm ok with closing this issue since a lot of water has gone under the bridge since and there's nothing actionable (and we'll remember this issue should this happen again), but others may disagree.

bdarnell · 2016-09-07T01:13:51Z

I'd vote to keep it open since this thread may still be useful in identifying the underlying bug, but I don't feel too strongly either way.

tamird · 2016-09-07T02:21:28Z

Could you give this issue a new title?

bdarnell · 2016-09-07T02:58:51Z

I'd rather not; "clobbering" is how I remember/search for this issue. Better to close it than retitle it at this point I think.

tbg · 2016-09-07T14:35:16Z

Closing for now.

bdarnell · 2016-09-29T04:25:04Z

Revisiting this in light of #9265: I don't think it's related. ChangeReplicas does have a bug that would allow meta2 records to be inconsistent with the primary range-local copy of the descriptor (and this could eventually lead to data loss as the replica GC queue incorrectly deleted things), I don't think it could lead to this kind of split-brain within the range (and this comment appears to indicate that we're not missing any meta records). #9265 always leads to the loss of meta records, not the record that is co-located with the transaction record.

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jun 14, 2016

tbg mentioned this issue Jun 14, 2016

stability: panic on basictracer.errAssertionFailed #7185

Closed

mberhault mentioned this issue Jun 14, 2016

stability: OOM in receiving raft message #7184

Closed

tbg added a commit to tbg/cockroach that referenced this issue Jun 14, 2016

storage: better logging on replica clobbering

1fbda89

See cockroachdb#7224.

tbg mentioned this issue Jun 14, 2016

storage: better logging on replica clobbering #7227

Merged

tbg added a commit to tbg/cockroach that referenced this issue Jun 14, 2016

storage: better logging on replica clobbering

739adaa

See cockroachdb#7224.

tbg added a commit to tbg/cockroach that referenced this issue Jun 14, 2016

storage: better logging on replica clobbering

7aced44

See cockroachdb#7224.

tbg added a commit to tbg/cockroach that referenced this issue Jun 21, 2016

storage: tighten assertion on applied index

da99af2

See cockroachdb#7224.

tbg added a commit to tbg/cockroach that referenced this issue Jun 22, 2016

storage: tighten assertion on applied index

327e3ce

See cockroachdb#7224.

tbg added a commit to tbg/cockroach that referenced this issue Jun 29, 2016

storage: verbose logging for conf changes

d900460

Cannot hurt to have this for cockroachdb#7224.

tbg mentioned this issue Jun 29, 2016

storage: verbose logging for conf changes #7527

Merged

petermattis added this to the Q3 milestone Jul 11, 2016

petermattis mentioned this issue Aug 19, 2016

storage: move an overly verbose log message under V(1) #8687

Merged

tbg closed this as completed Sep 7, 2016

bdarnell mentioned this issue Sep 11, 2016

stability: gap in meta2 addressing records on beta cluster #9265

Closed

bdarnell mentioned this issue Sep 29, 2016

stability: node fails to restart with range for Range ID 92 already exists on store #5291

Closed

a-robinson mentioned this issue Jan 4, 2017

stability: leader-less range stalls entire cluster #12591

Closed

bdarnell mentioned this issue Jan 30, 2018

storage: too much time spent in assertStateLocked() #21901

Closed

stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224

stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224

Comments

tbg commented Jun 14, 2016 • edited by mberhault Loading

tbg commented Jun 14, 2016 • edited Loading

mberhault commented Jun 14, 2016

bdarnell commented Jun 14, 2016

tbg commented Jun 14, 2016

mberhault commented Jun 14, 2016

tbg commented Jun 14, 2016 • edited Loading

tbg commented Jun 14, 2016

mberhault commented Jun 14, 2016

tbg commented Jun 14, 2016

mberhault commented Jun 14, 2016

mberhault commented Jun 14, 2016

tbg commented Jun 14, 2016

tbg commented Jun 14, 2016

tbg commented Jun 15, 2016 • edited Loading

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016 • edited Loading

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016

tbg commented Jun 15, 2016 • edited Loading

cuongdo commented Jun 29, 2016

tbg commented Jun 30, 2016 • edited Loading

tbg commented Jun 30, 2016 • edited Loading

tbg commented Jun 30, 2016 • edited Loading

tbg commented Jul 5, 2016

tbg commented Jul 5, 2016

tbg commented Jul 7, 2016

tamird commented Sep 6, 2016

tbg commented Sep 6, 2016

bdarnell commented Sep 7, 2016

tamird commented Sep 7, 2016

bdarnell commented Sep 7, 2016

tbg commented Sep 7, 2016

bdarnell commented Sep 29, 2016

tbg commented Jun 14, 2016 •

edited by mberhault

Loading

tbg commented Jun 14, 2016 •

edited

Loading

tbg commented Jun 14, 2016 •

edited

Loading

tbg commented Jun 15, 2016 •

edited

Loading

tbg commented Jun 15, 2016 •

edited

Loading

tbg commented Jun 15, 2016 •

edited

Loading

tbg commented Jun 30, 2016 •

edited

Loading

tbg commented Jun 30, 2016 •

edited

Loading

tbg commented Jun 30, 2016 •

edited

Loading