Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224

Closed
tbg opened this issue Jun 14, 2016 · 74 comments
Closed

stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224

tbg opened this issue Jun 14, 2016 · 74 comments
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jun 14, 2016

summary:

F160613 16:02:44.438105 storage/store.go:2186  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
goroutine 104 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecef0d001, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x2879de0, 0x3, 0x228d6f3, 0x10, 0x88a, 0xc8249602d0, 0x4a)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:728 +0x9c5
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1b617a0, 0x2a, 0xc822f7b980, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1b617a0, 0x2a, 0xc822f7b980, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1b617a0, 0x2a, 0xc822f7b980, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc82018e280, 0xe53, 0x200000002, 0xc800000005)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2186 +0x288
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc82018e280, 0xc8243dc7e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:1985 +0x4e5
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2066 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc820343570, 0xc82139cbb0)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

Attempted upgrade to 0b950ce, still crashing with same trace.

Node 4 is crashing with:

F160614 20:26:27.555255 storage/store.go:2243  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}

Node 5 is crashing with:

F160614 20:26:31.004285 storage/store.go:2243  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jun 14, 2016
@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

clobbering {2 2 5} with {6 6 5} means that we had ReplicaID=5 stored as NodeID=2, StoreID=2 but that an attempt was made to cache NodeID=6 StoreID=6. Can you pull a data dump so that we can preserve that state? We should also improve that message - it doesn't even tell us the RangeID.

@mberhault
Copy link
Contributor

backing up data/logs on nodes 4 and 5. It'll take a little while. data dir is ~80GiB
locations: /mnt/data/backup.7224 and ~/logs.7224

@bdarnell
Copy link
Contributor

There's another instance of this in #7184.

@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

Thanks @mberhault. I'll look into the data files and will try to find out which Range is the offending one, and what replicas and versions of meta entries we have. We shouldn't be seeing two different replicas with the same ReplicaID, naturally.

@mberhault
Copy link
Contributor

@bdarnell: sorry, that's where my update went. I apparently mistakenly pasted the "gamma rollback -> clobber!" problem to the wrong bug, it was supposed to be part of this one. I"ll shuffle things around a bit

@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

We should also back up nodes 1,2,3 (god knows where the meta ranges are, or who sends the Raft message that kills things)

@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

The stacktrace is s.cacheReplicaDescriptorLocked(req.GroupID, req.FromReplica). How are the NodeIDs distributed? The panic posted above hints at Node6 sending a message from a ReplicaID which is also known to the recipient for another peer. Is the crash message the same for both failing nodes?

@mberhault
Copy link
Contributor

I updated the original issue message with more details, including the message on each node:
Node 4 is crashing with:

F160614 20:26:27.555255 storage/store.go:2243  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}

Node 5 is crashing with:

F160614 20:26:31.004285 storage/store.go:2243  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}

@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

If I were to take a guess, I would guess that the message comes in both orders - can you verify that for Node5 (i.e. try a bunch of times, you should see random between the two messages)

@mberhault
Copy link
Contributor

right you are. no need to run it again, just grep through the logs:

4: 104.196.98.24
logs/cockroach.stderr:F160613 15:55:40.307717 storage/store.go:2226  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}
logs/cockroach.stderr:F160613 16:02:49.620753 storage/store.go:2186  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160613 17:09:43.303798 storage/store.go:2186  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160614 20:26:27.555255 storage/store.go:2243  store=4:4 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}

5: 104.196.39.184
logs/cockroach.stderr:F160613 15:49:20.269201 storage/store.go:2226  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160613 15:55:56.176292 storage/store.go:2186  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {2 2 5} with {6 6 5}
logs/cockroach.stderr:F160613 16:02:44.438105 storage/store.go:2186  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}
logs/cockroach.stderr:F160614 20:26:31.004285 storage/store.go:2243  store=5:5 ([]=/mnt/data) replicaDescCache: clobbering {6 6 5} with {2 2 5}

@mberhault
Copy link
Contributor

Messages with #7227 patched in:

Node 4:

E160614 21:19:56.316503 storage/replica_command.go:1877  replica: range=2341 [/System/"tsd\x12cr.store.intentbytes\x00\x01\x89\xf8\x066\x014"-/System/"tsd\x12cr.store.livebytes\x00\x01\x89\xf8\x065\xe93"), checksum mismatch: e = 3d25dd6a5b188a71efc8e21a5609eed686327cb99194247f676cb0f4f35b3296f73c3b22769737a513f5c1f96dcb82d76a298bae8e2b00eca02fd0ed165bf744, v = 4eccc4fb1f914635699566c9eb6a099f44fdf778c330f51e8b6dc744c19c21c640205d8573346035841d88f95eb823ce89fec2fcc822b766ee879f661881860f
F160614 21:19:56.579455 storage/store.go:2248  store 3667, range 4: clobbering {NodeID:2 StoreID:2 ReplicaID:5} with {NodeID:6 StoreID:6 ReplicaID:5} in replicaDescCache; have replica range=3667 [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522-/Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
goroutine 86 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecef26b01, 0x0, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x27eeea0, 0x3, 0x221d6cc, 0x10, 0x8c8, 0xc821019a40, 0x131)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:728 +0x9c5
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1b94300, 0x51, 0xc82471d9c8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1b94300, 0x51, 0xc82471d9c8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1b94300, 0x51, 0xc82471d9c8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc8202882c0, 0xe53, 0x600000006, 0xc800000005)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2248 +0x3ef
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc8202882c0, 0xc821bd7e60, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2041 +0x4e5
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2123 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8203741c0, 0xc820e9b350)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

Node 5:

E160614 21:20:00.988990 storage/replica_command.go:1877  replica: range=2390 [/System/"tsd\x12cr.node.sys.cpu.user.percent\x00\x01\x89\xf8\x065\xf45"-/System/"tsd\x12cr.node.sys.gc.pause.percent\x00\x01\x89\xf8\x066 2"), checksum mismatch: e = 81431216f559043e551181be5e6ac47a9cfa23c13bf7e9b504d02eb33825000a588b010f936d4ca68fed8ddf7f1ae8329cd67397130faec5f2d21ae7772e44fb, v = 701abbb298e2e346ed1e31b5cc9ee8ed33f440dd8fd13b53ff6e9dfb6a4e89e5f2695025a0fb65547230210430598e7bbf9a58266e1ed2259fc8bed90967f765
F160614 21:20:01.012564 storage/store.go:2248  store 3667, range 5: clobbering {NodeID:2 StoreID:2 ReplicaID:5} with {NodeID:6 StoreID:6 ReplicaID:5} in replicaDescCache; have replica range=3667 [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522-/Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
goroutine 60 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecef26c01, 0x0, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x27eeea0, 0x3, 0x221d6cc, 0x10, 0x8c8, 0xc8210b28c0, 0x131)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/clog.go:728 +0x9c5
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1b94300, 0x51, 0xc823f9b9b8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1b94300, 0x51, 0xc823f9b9b8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1b94300, 0x51, 0xc823f9b9b8, 0x5, 0x5)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc8201d82c0, 0xe53, 0x600000006, 0xc800000005)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2248 +0x3ef
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc8201d82c0, 0xc82422d9e0, 0x0, 0x0)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2041 +0x4e5
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/storage/store.go:2123 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc82031a310, 0xc820ebd160)
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /home/marc/cockroach/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

Thanks, looks like I need to update the PR (store-id switched)

tbg added a commit to tbg/cockroach that referenced this issue Jun 14, 2016
tbg added a commit to tbg/cockroach that referenced this issue Jun 14, 2016
@tbg
Copy link
Member Author

tbg commented Jun 14, 2016

range 3667 is on nodes 2, 5, and 4. Here's the history for node4:

(nodes@last_assigned_replica_id), old to new
3,5,6@3
3,5,6,4@4
4,5,6
4,5
4,5,2@5 

I'm going to run the same for node5 but am anticipating that we will have to look at Node 2 as well (which is running at the moment).

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

now wait for it:

node5:

3,5,6@3
3,5,6,4@4
4,5,6
4,5
4,5,6@5

This means that Node5 and Node4 have conflicting range descriptors - one thinks that 6 has been added at replicaID5, the other one thinks it's 2.
I'm going to have to double-check whether one of them may be an intent.

22:06 $ diff gamma{4,5}
1,3c1,3
< cockroach-gamma-4
< 1465832576.787583102,0 /Local/Range/"\xbb\x89\xfd\x02\b\xd8$\xf7F\xc3C\x12ea15ce2b-daec-4e93-b8b1-3e356489504b\x00\x01\xf8#\xb6\xaa"/RangeDescriptor: [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522, /Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
<   Raw:range_id:3667 start_key:"\273\211\375\002\010\330$\367F\303C\022ea15ce2b-daec-4e93-b8b1-3e356489504b\000\001\370#\266\252" end_key:"\273\211\375\002\020\327r}os5\022a34511ab-48b9-4c14-87cc-f18ec0ae991d\000\001\370\027\315~" replicas:<node_id:4 store_id:4 replica_id:4 > replicas:<node_id:5 store_id:5 replica_id:2 > replicas:<node_id:2 store_id:2 replica_id:5 > next_replica_id:6
---
> cockroach-gamma-5
> 1465832522.814160856,0 /Local/Range/"\xbb\x89\xfd\x02\b\xd8$\xf7F\xc3C\x12ea15ce2b-daec-4e93-b8b1-3e356489504b\x00\x01\xf8#\xb6\xaa"/RangeDescriptor: [/Table/51/1/146604641168573251/"ea15ce2b-daec-4e93-b8b1-3e356489504b"/2340522, /Table/51/1/148855674433925941/"a34511ab-48b9-4c14-87cc-f18ec0ae991d"/1559934)
>   Raw:range_id:3667 start_key:"\273\211\375\002\010\330$\367F\303C\022ea15ce2b-daec-4e93-b8b1-3e356489504b\000\001\370#\266\252" end_key:"\273\211\375\002\020\327r}os5\022a34511ab-48b9-4c14-87cc-f18ec0ae991d\000\001\370\027\315~" replicas:<node_id:4 store_id:4 replica_id:4 > replicas:<node_id:5 store_id:5 replica_id:2 > replicas:<node_id:6 store_id:6 replica_id:5 > next_replica_id:6

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

@mberhault I think this is sufficiently broken to bring down the rest of the nodes and pull backups from the ones that haven't been backed up yet.

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

The timestamps in the diff above are Mon, 13 Jun 2016 15:42:56 GMT and Mon, 13 Jun 2016 15:42:02 GMT, respectively. That is, not even close.

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

ubuntu@cockroach-gamma-4:~$ grep ' 3667' logs.7224/cockroach.stderr
W160613 14:00:09.850845 raft/raft.go:593  [group 3667] 4 stepped down to follower since quorum is not active
I160613 15:42:09.760249 storage/replica_command.go:1531  range 3667: new leader lease replica {4 4 4} 2016-06-13 15:42:05.60192409 +0000 UTC 8.582510867s following replica {5 5 2} 2016-06-13 14:01:23.490070436 +0000 UTC 1h40m42.111853654s [physicalTime=2016-06-13 15:42:09.760204119 +0000 UTC]
W160613 15:42:12.831446 raft/raft.go:593  [group 3667] 4 stepped down to follower since quorum is not active
I160613 15:42:56.787411 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3238 in 2.085009622s. encoded size=35078995, 87412 KV pairs, 17 log entries
I160613 15:42:56.949600 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:5} for range 3667
I160613 15:43:15.633267 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3241 in 256.354269ms. encoded size=35081113, 87413 KV pairs, 20 log entries
I160613 15:49:00.406459 storage/replica_command.go:1538  range 3667: replicaID 4 transfer raft leadership to replicaID 2
W160613 15:55:02.876732 storage/raft_log_queue.go:114  storage/raft_log_queue.go:99: raft log's oldest index (0) is less than the first index (3253) for range 3667

We can see that Node4 itself proposed the replica change which led to its local range descriptor written at 15:42:56GMT. But, very interestingly, we see at 15:42:09 that the node takes the leader lease following node2 - that node is added to that range for the very first time just before, by node5 (the conflicting ChangeReplicas).

Shot in the blue: if, by some mechanism, we would sometimes delete intents that should commit, we'd probably see behaviour like this - the Raft group updated, but the on-disk state (at least partially) lost, similar to #5291.

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

Node5, meanwhile, isn't the leader and only receives a snapshot: totally wrong, disregard this post and see third post below. This log below is actually from node 6.

logs/cockroach.stderr.3:I160613 15:42:29.706134 storage/replica_raftstorage.go:681  replica 5 received snapshot for range 3667 at index 3235. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr.3:I160613 15:42:30.728217 storage/replica_raftstorage.go:683  replica 5 applied snapshot for range 3667 in 1.022038112s

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

Node2 is also boring, it only receives a snapshot at a time between the first and second conflicting ChangeReplica (so it must be caused by the first, though it's awfully late for that)

logs/cockroach.stderr.2:I160613 15:43:16.272650 storage/replica_raftstorage.go:681  replica 5 received snapshot for range 3667 at index 3241. encoded size=35081113, 87413 KV pairs, 20 log entries
logs/cockroach.stderr.2:I160613 15:43:16.592619 storage/replica_raftstorage.go:683  replica 5 applied snapshot for range 3667 in 319.898665ms

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

Here's a breakdown of all propositions from the logs.

22:44 $ for l in $(cat ~/production/hosts/gamma); do echo $l; ssh $l "grep -R 'proposing ADD_REPLICA.*range 3667' ~/logs"; done
[email protected] #1
[email protected] #2
[email protected] #3
/home/ubuntu/logs/cockroach.stderr.2:I160610 16:06:55.125630 storage/replica.go:1385  raft: proposing ADD_REPLICA {4 4 4} for range 3667
[email protected] #4
/home/ubuntu/logs/cockroach.stderr:I160613 15:42:56.949600 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:5} for range 3667
[email protected] #5
/home/ubuntu/logs/cockroach.stderr:I160613 15:42:03.172227 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:6 StoreID:6 ReplicaID:5} for range 3667
[email protected] #6

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

Scratch the post three above, I can't use grep. Node5 really is in on the action:

ubuntu@cockroach-gamma-5:~$ grep -R ' 3667' logs/cockroach.stderr* | grep 160613
logs/cockroach.stderr:W160613 14:01:42.774373 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active
logs/cockroach.stderr:I160613 15:42:00.857693 storage/replica_command.go:1531  range 3667: new leader lease replica {5 5 2} 2016-06-13 14:01:23.490070436 +0000 UTC 1h40m42.111853654s following replica {4 4 4} 2016-06-10 22:11:24.129040512 +0000 UTC 63h49m59.361029924s [physicalTime=2016-06-13 15:42:00.857647871 +0000 UTC]
logs/cockroach.stderr:I160613 15:42:02.813966 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3229 in 1.955395552s. encoded size=35074262, 87410 KV pairs, 8 log entries
logs/cockroach.stderr:I160613 15:42:03.172227 storage/replica.go:1438  raft: proposing ADD_REPLICA {NodeID:6 StoreID:6 ReplicaID:5} for range 3667
logs/cockroach.stderr:I160613 15:42:08.862460 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3232 in 426.023892ms. encoded size=35076362, 87411 KV pairs, 11 log entries
logs/cockroach.stderr:I160613 15:42:09.751016 storage/replica_command.go:1538  range 3667: replicaID 2 transfer raft leadership to replicaID 4
logs/cockroach.stderr:I160613 15:42:22.476634 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 426.068352ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:I160613 15:42:23.715516 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 569.103901ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:I160613 15:42:25.336997 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 288.042422ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:W160613 15:42:25.631158 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active
logs/cockroach.stderr:I160613 15:42:27.735413 storage/replica_raftstorage.go:587  generated snapshot for range 3667 at index 3235 in 226.824326ms. encoded size=35079019, 87413 KV pairs, 14 log entries
logs/cockroach.stderr:W160613 15:44:09.030377 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active
logs/cockroach.stderr:I160613 15:49:00.418396 storage/replica_command.go:1531  range 3667: new leader lease replica {5 5 2} 2016-06-13 15:46:53.021458351 +0000 UTC 2m11.655136997s following replica {4 4 4} 2016-06-13 15:42:05.60192409 +0000 UTC 4m47.419534261s [physicalTime=2016-06-13 15:49:00.418359155 +0000 UTC]
logs/cockroach.stderr:W160613 15:49:07.832094 raft/raft.go:593  [group 3667] 2 stepped down to follower since quorum is not active

It grabs the lease from node4, proposes its conf change, sends a few snapshots, and gives the lease back to 4 (which then does the conflicting replica change).

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

gist: all sent or received snapshots https://gist.github.com/tschottdorf/c9630dfce6a66c49a8e5c655053b0e2d

tl;dr (not in temporal order)
2 receives at 3241 -> from 4 after it's done its conflicting replica change
4 sends at 3238,3241 -> last one goes to 2
5 sends at 3229, 3232, 3235x4 -> goes to 6
6 receives at 3235 -> comes from 5

@tbg
Copy link
Member Author

tbg commented Jun 15, 2016

to comment on my "shot in the blue" earlier: node 4 and 5 have different versions of the same key, which should not be the cause of an operation which does the wrong thing on a replicated level. Should look at the Raft logs too to see what index they have applied.

tbg added a commit to tbg/cockroach that referenced this issue Jun 21, 2016
tbg added a commit to tbg/cockroach that referenced this issue Jun 22, 2016
@cuongdo
Copy link
Contributor

cuongdo commented Jun 29, 2016

This is happening pretty regularly with the TestRebalance3to5Small acceptance test. Here's a failed run.. Full node logs are in the "Artifacts" sub-tab.

Relevant excerpt from acceptance log:

F160629 01:06:04.139942 storage/store.go:2299  store {ClusterID:{UUID:d840ed4f-e385-42d2-8eb7-faddc8ef2b71} NodeID:2 StoreID:2}, range 1: clobbering {NodeID:3 StoreID:3 ReplicaID:3} with {NodeID:2 StoreID:2 ReplicaID:3} in replicaDescCache; have replica range=1 [/Min-/System/"update-cluster")
goroutine 33 [running]:
github.com/cockroachdb/cockroach/util/log.getStacks(0xecf051500, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:862 +0xb8
github.com/cockroachdb/cockroach/util/log.(*loggingT).outputLogEntry(0x29644c0, 0x3, 0x2342c26, 0x10, 0x8fb, 0xc820b4ed00, 0xf6)
        /go/src/github.com/cockroachdb/cockroach/util/log/clog.go:726 +0x178
github.com/cockroachdb/cockroach/util/log.addStructured(0x0, 0x0, 0xc800000003, 0x2, 0x1c92680, 0x52, 0xc8206939c8, 0x5, 0x5)
        /go/src/github.com/cockroachdb/cockroach/util/log/structured.go:57 +0xfc
github.com/cockroachdb/cockroach/util/log.logDepth(0x0, 0x0, 0x1, 0xc800000003, 0x1c92680, 0x52, 0xc8206939c8, 0x5, 0x5)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:73 +0x7b
github.com/cockroachdb/cockroach/util/log.Fatalf(0x1c92680, 0x52, 0xc8206939c8, 0x5, 0x5)
        /go/src/github.com/cockroachdb/cockroach/util/log/log.go:179 +0x66
github.com/cockroachdb/cockroach/storage.(*Store).cacheReplicaDescriptorLocked(0xc820162000, 0x1, 0x200000002, 0xc800000003)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2299 +0x3e9
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc820162000, 0xc8205c97a0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2093 +0x515
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2174 +0x552
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc820350cb0, 0xc820181640)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:142 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:143 +0x62

tbg added a commit to tbg/cockroach that referenced this issue Jun 29, 2016
@tbg
Copy link
Member Author

tbg commented Jun 30, 2016

This repros quickly with

make test TESTTIMEOUT=48h PKG=./acceptance TESTS=Rebalance_3To5Small TESTFLAGS='-v --remote --key-name google_compute_engine --cwd=allocator_terraform -tf.keep-cluster-fail'

as pointed out by @cuongdo and @BramGruneir.

I'm investigating. Looks like the first node proposes the two conflicting upreplications that soon thereafter let it clobber. I hope it's the same issue as here because if so, this is hopefully well-understood soon.

@tbg
Copy link
Member Author

tbg commented Jun 30, 2016

Looks at least similar:

F160630 19:32:47.259015 storage/store.go:2328  store {ClusterID:{UUID:d840ed4f-e385-42d2-8eb7-faddc8ef2b71} NodeID:1 StoreID:1}, range 1: clobbering {NodeID:2 StoreID:2 ReplicaID:2} with {NodeID:3 StoreID:3 ReplicaID:2} in replicaDescCache; have replica range=1 [/Min-/System/"update-cluster")

i.e. node3 and node2 both at some point messaged us and claimed to have replicaID=2 for range 1.

Here are the descriptors for range 1:

1464889133.146965811,0 /Local/Range/""/RangeDescriptor: [/Min, /System/"update-cluster")
        Raw:range_id:1 start_key:"" end_key:"\004update-cluster" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4

1464886890.124237716,0 /Local/Range/""/RangeDescriptor: [/Min, /Table/11)
        Raw:range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4

1464886661.532168512,0 /Local/Range/""/RangeDescriptor: [/Min, /Table/11)
        Raw:range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3

1464803321.941369236,34 /Local/Range/""/RangeDescriptor: [/Min, /Table/11)
        Raw:range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2

1464803321.689672339,0 /Local/Range/""/RangeDescriptor: [/Min, /Max)
        Raw:range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2

That seems clean. Reading bottom to top, the range first splits, then adds store2 at replicaid=2, then adds store3 at replicaid=3. Then, finally, another split. But the panic message indicates something different, namely that both store3 and store2 talked to us claiming they had replica=2. And in fact, going through the info messages produced by the replica change, we see

I160630 19:30:32.375925 storage/replica_command.go:2709  change replicas of 1: read existing descriptor range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica
_id:1 > next_replica_id:2

and the second one

I160630 19:30:32.753520 storage/replica_command.go:2709  change replicas of 1: read existing descriptor range_id:1 start_key:"" end_key:"\223" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:3 store_id:3 replica_id:2 > next_replica_id:3

This means that store3 actually has replicaID=2 and store2 should have replicaID=3. I'm going to look at their data files next.

@tbg
Copy link
Member Author

tbg commented Jun 30, 2016

Nodes 2 and 3 agree as far as the range descriptors and meta entries are concerned (the latter are read with ./cockroach debug keys data0 --from=$(echo -n -e '\003') --to=$(echo -n -e '\004') --values | less -S). Maybe I'm cross-eyed and was reading the debug output wrong - checking.

@tbg
Copy link
Member Author

tbg commented Jul 5, 2016

The plot is thickening. It looks like the descriptors we write are somehow messed up: when store 3 gets replica_id 2, we write a desc with replica_id=2. Meaning that we probably somewhere mix up replica_id and store_id. Which means that when store 3 gets its replica first, and then store 2, they will be mixed up - it should be (3,2) (2,3) but you get (3,3) (2,2). However, the in-memory state doesn't have that issue.

Still speculation, but I have a repro with binaries that I can put assertions into and a <10min repro cycle, so hope to nail this one soon.

tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2016
The tests were sloppy in setting up the initial cluster with
the downloaded data directories. There was a window of time
during which the pre-restore cluster and the post-restore
cluster had nodes running simultaneously. See cockroachdb#7224.
@tbg
Copy link
Member Author

tbg commented Jul 5, 2016

The wrong plot thickened: The allocator tests were mixing two separate clusters and the panic was an artifact of RangeDescriptors in disagreement. #7625
Back to square one on this one. I'll at least put some more useful debugging in.

tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2016
The tests were sloppy in setting up the initial cluster with
the downloaded data directories. There was a window of time
during which the pre-restore cluster and the post-restore
cluster had nodes running simultaneously. See cockroachdb#7224.
tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2016
The tests were sloppy in setting up the initial cluster with
the downloaded data directories. There was a window of time
during which the pre-restore cluster and the post-restore
cluster had nodes running simultaneously. See cockroachdb#7224.
@tbg
Copy link
Member Author

tbg commented Jul 7, 2016

Just a heads up that we lost the logs for the gamma backup.

@tamird
Copy link
Contributor

tamird commented Sep 6, 2016

I believe this issue is no longer relevant since 2f1f092 went in. @tschottdorf?

@tbg
Copy link
Member Author

tbg commented Sep 6, 2016

The underlying bug may still be relevant since it has never been resolved, but the mechanism by which it would fail would have changed completely. I'm ok with closing this issue since a lot of water has gone under the bridge since and there's nothing actionable (and we'll remember this issue should this happen again), but others may disagree.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 7, 2016

I'd vote to keep it open since this thread may still be useful in identifying the underlying bug, but I don't feel too strongly either way.

@tamird
Copy link
Contributor

tamird commented Sep 7, 2016

Could you give this issue a new title?

@bdarnell
Copy link
Contributor

bdarnell commented Sep 7, 2016

I'd rather not; "clobbering" is how I remember/search for this issue. Better to close it than retitle it at this point I think.

@tbg
Copy link
Member Author

tbg commented Sep 7, 2016

Closing for now.

@bdarnell
Copy link
Contributor

Revisiting this in light of #9265: I don't think it's related. ChangeReplicas does have a bug that would allow meta2 records to be inconsistent with the primary range-local copy of the descriptor (and this could eventually lead to data loss as the replica GC queue incorrectly deleted things), I don't think it could lead to this kind of split-brain within the range (and this comment appears to indicate that we're not missing any meta records). #9265 always leads to the loss of meta records, not the record that is co-located with the transaction record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants