Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: freeze-cluster failure #7238

Closed
mberhault opened this issue Jun 15, 2016 · 8 comments
Closed

stability: freeze-cluster failure #7238

mberhault opened this issue Jun 15, 2016 · 8 comments
Milestone

Comments

@mberhault
Copy link
Contributor

sha: c17928f

Tried to freeze nodes on beta prior to upgrade. All nodes were currently healthy.
started cockroach freeze-cluster against node1:

$ cockroach freeze-cluster --host=104.196.97.139 --ca-cert=certs/ca.crt --cert=certs/root.client.c
rt --key=certs/root.client.key
proposed freeze to 4238 ranges
waiting for 6 stores to apply operation
node 1, store 1: 1 replicas report wrong status: [{NodeID:1 StoreID:1 ReplicaID:3}]
node 3, store 3: ready
node 4, store 4: 1 replicas report wrong status: [{NodeID:4 StoreID:4 ReplicaID:6}]
node 2, store 2: 997 replicas report wrong status [truncated]: [{NodeID:2 StoreID:2 ReplicaID:4} {NodeID:2 StoreID:2 Rep
licaID:5} {NodeID:2 StoreID:2 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:9} {NodeID:2 StoreID:2 ReplicaID:6} {NodeID:2 S
toreID:2 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:7} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:6}
 {NodeID:2 StoreID:2 ReplicaID:2}]
node 6, store 6: 1 replicas report wrong status: [{NodeID:6 StoreID:6 ReplicaID:4}]
node 5, store 5: 1 replicas report wrong status: [{NodeID:5 StoreID:5 ReplicaID:1}]
waiting for 5 stores to apply operation
.....
waiting for 5 stores to apply operation
node 4, store 4: 1 replicas report wrong status: [{NodeID:4 StoreID:4 ReplicaID:6}]
node 1, store 1: 1 replicas report wrong status: [{NodeID:1 StoreID:1 ReplicaID:3}]
node 2, store 2: 997 replicas report wrong status [truncated]: [{NodeID:2 StoreID:2 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:5} {NodeID:2 StoreID:2 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:9} {NodeID:2 StoreID:2 ReplicaID:6} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:7} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:6} {NodeID:2 StoreID:2 ReplicaID:2}]
node 6, store 6: 1 replicas report wrong status: [{NodeID:6 StoreID:6 ReplicaID:4}]
node 5, store 5: 1 replicas report wrong status: [{NodeID:5 StoreID:5 ReplicaID:1}]
waiting for 5 stores to apply operation
Error: rpc error: code = 2 desc = timed out waiting for 5 stores to report freeze
Failed running "freeze-cluster"

I stopped the block_writer and photos apps sometime after I started the freeze-cluster command, but long before it terminated.

@mberhault
Copy link
Contributor Author

Copying logs directories on all nodes to ~/logs.7238

@tbg
Copy link
Member

tbg commented Jun 15, 2016

We just tried the freeze again and it reported 0 newly frozen ranges
(meaning that at least all leaders we talked to the first time had actually
applied the freeze). I'm not surprised seeing a few stores with one
offending replica (after all, that might be one that had been removed but
not gc'ed), but Store 2 with its 997 replicas is definitely off.
When we restarted the cluster, it didn't seem to recover from the freeze on
its own and subsequent attempts to unfreeze failed (looked like they were
getting stuck on Node2 again).

My feeling is that it's not only the freeze that didn't go as planned here.

On Wed, Jun 15, 2016 at 9:50 AM marc [email protected] wrote:

Copying logs directories on all nodes to ~/logs.7238


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7238 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135JBAKnmyrefnt60YyLFL6RtmE67Oks5qMAMngaJpZM4I2YGa
.

-- Tobias

@mberhault
Copy link
Contributor Author

Attempted another restart as well as freeze-cluster --undo, to no avail.
Any idea what I can do to unwedge this?

@mberhault
Copy link
Contributor Author

Raft debug endpoint (for node 2): https://104.196.24.126:8080/_status/raft

Sample frozen ranges:

    {1492 /Table/51/1/332447685264762846/"201dc0ce-556c-46d0-bd1a-a70f928f07ae"/2062006 /Table/51/1/334716605612099869/"0a8adc1d-a9f4-428e-92e7-e49badc56d15"/148468 [{6 6 1} {5 5 2} {4 4 3}] 4}
    {3610 /Table/51/1/3395665550234296895/"6bd2f5b3-3af9-4fe7-b7c2-c0ffd2eaa929"/1274405 /Table/51/1/3397907315085389800/"9804dadf-1cca-4e6c-9276-25412f43cb83"/19208 [{6 6 1} {5 5 2} {4 4 3}] 4}
    {4228 /Table/51/1/918089996508004678/"d6ac6055-6db2-41df-8c56-5678a265b258"/1734081 /Table/51/1/920306392201122977/"02a2e92f-2454-4415-aa02-e4e38b28d4de"/592690 [{1 1 1} {2 2 2} {6 6 3}] 4}
    {230 /System/"update-cluster" /Table/11 [{2 2 4} {5 5 2} {1 1 5}] 6}

@mberhault
Copy link
Contributor Author

Data and logs on 104.196.24.126 copied to /mnt/data/backup.7238 and ~/logs.7238
Restarting node.

@tbg
Copy link
Member

tbg commented Jun 15, 2016

The Raft log for 230 is not spectacular: it's been truncated to ~20 entries and shows nothing but the leader lease going back and forth between node 2 and node 5.

@mberhault
Copy link
Contributor Author

I believe this will be deprecated by proposer-eval-kv. @bdarnell, can you please confirm and close?

@bdarnell
Copy link
Contributor

Let's wait until propEvalKV is done and we can verify that the upgrade goes smoothly. (and then we can replace this issue with one to rip out all the freeze-cluster machinery)

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 11, 2017
This command was unfinished, and now it never will be as freezing the
cluster is no longer an acceptable upgrade process.

Closes cockroachdb#7238
Closes cockroachdb#7038
Closes cockroachdb#7928
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 11, 2017
This command was unfinished, and now it never will be as freezing the
cluster is no longer an acceptable upgrade process.

Closes cockroachdb#7238
Closes cockroachdb#7038
Closes cockroachdb#7928
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants