Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Zero QPS on cyan #17741

Closed
bdarnell opened this issue Aug 17, 2017 · 16 comments
Closed

stability: Zero QPS on cyan #17741

bdarnell opened this issue Aug 17, 2017 · 16 comments
Assignees

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Aug 17, 2017

The cyan (continuous deployment) cluster has been seeing zero qps for most of the last four hours.

image

The start of the problems corresponds to an upgrade spanning the following range of commits (more than we'd like to see in continuous deployment, but the CI server was having trouble): 8cf2141...c13339d

@bdarnell bdarnell self-assigned this Aug 17, 2017
@bdarnell
Copy link
Contributor Author

In node 1's logs, just before everything goes to zero, we see lots of RPCs (both reads and writes) hanging on a handful of different ranges:

W170817 16:08:06.602921 2934 kv/dist_sender.go:1159  [client=127.0.0.1:48534,user=root,n1] have been waiting 1m0s sending RPC to r1055 for batch: [txn: 4777040d], Scan [/Table/51/1/2343040423567134585,/Table/51/1/2343040423567134586)
W170817 16:08:08.448937 2950 kv/dist_sender.go:1159  [client=127.0.0.1:48512,user=root,n1] have been waiting 1m0s sending RPC to r1055 for batch: [txn: aa80d7a8], Scan [/Table/51/1/2342652009156662599,/Table/51/1/2342652009156662600)
W170817 16:08:09.971278 2902 kv/dist_sender.go:1159  [client=127.0.0.1:48476,user=root,n1] have been waiting 1m0s sending RPC to r1055 for batch: [txn: 48eb8bf0], Scan [/Table/51/1/2343993231727215555,/Table/51/1/2343993231727215556)
W170817 16:08:10.910791 2908 kv/dist_sender.go:1159  [client=127.0.0.1:48450,user=root,n1] have been waiting 1m0s sending RPC to r4534 for batch: [txn: 63b26d09], Scan [/Table/51/1/8888489703156528817,/Table/51/1/8888489703156528818)
W170817 16:08:11.968480 2903 kv/dist_sender.go:1159  [client=127.0.0.1:48472,user=root,n1] have been waiting 1m0s sending RPC to r5002 for batch: [txn: 56f72e34], Scan [/Table/51/1/-196030775519770680,/Table/51/1/-196030775519770679)
W170817 16:08:13.008812 2864 kv/dist_sender.go:1159  [client=127.0.0.1:48548,user=root,n1] have been waiting 1m0s sending RPC to r4534 for batch: [txn: 9abe57ad], BeginTransaction [/Table/51/1/8886733918494989349/0,/Min), Put [/Table/51/1/8886733918494989349/0,/Min), EndTransaction [/Table/51/1/8886733918494989349/0,/Min)
W170817 16:08:14.510813 2925 kv/dist_sender.go:1159  [client=127.0.0.1:48576,user=root,n1] have been waiting 1m0s sending RPC to r2505 for batch: [txn: f2ea383e], BeginTransaction [/Table/51/1/-191524590013694750/0,/Min), Put [/Table/51/1/-191524590013694750/0,/Min), EndTransaction [/Table/51/1/-191524590013694750/0,/Min)
W170817 16:08:14.520626 2904 kv/dist_sender.go:1159  [client=127.0.0.1:48468,user=root,n1] have been waiting 1m0s sending RPC to r4525 for batch: [txn: 8a0a3cba], BeginTransaction [/Table/51/1/-6432910072453335426/0,/Min), Put [/Table/51/1/-6432910072453335426/0,/Min), EndTransaction [/Table/51/1/-6432910072453335426/0,/Min)

During the zero-qps periods, everything is pretty stable:

I170817 16:14:55.956538 38 server/status/runtime.go:223  [n1] runtime stats: 4.9 GiB RSS, 343 goroutines, 171 MiB/38 MiB/300 MiB GO alloc/idle/total, 3.9 GiB/4.5 GiB CGO alloc/total, 729.56cgo/sec, 0.04/0.01 %(u/s)time, 0.00 %gc (0x)
I170817 16:15:05.956140 38 server/status/runtime.go:223  [n1] runtime stats: 4.9 GiB RSS, 343 goroutines, 192 MiB/24 MiB/300 MiB GO alloc/idle/total, 3.9 GiB/4.5 GiB CGO alloc/total, 829.33cgo/sec, 0.04/0.01 %(u/s)time, 0.00 %gc (0x)
I170817 16:15:15.959986 38 server/status/runtime.go:223  [n1] runtime stats: 4.9 GiB RSS, 343 goroutines, 119 MiB/67 MiB/300 MiB GO alloc/idle/total, 3.9 GiB/4.5 GiB CGO alloc/total, 701.93cgo/sec, 0.06/0.01 %(u/s)time, 0.00 %gc (1x)

The logs on nodes 2-4 look similar, at least for a first pass.

Nodes 5 and 6 have a lot of quiescence-related log spam during the zero-qps periods (and before), involving the same ranges as the other nodes' hanging RPCs:

I170817 16:17:31.992910 132 storage/replica.go:3067  [n5,s5,r1055/1:/Table/51/1/234{21766…-66996…}] not quiescing: 1 pending commands
I170817 16:17:32.110714 146 storage/replica.go:3067  [n5,s5,r4525/1:/Table/51/1/-643{7114…-2634…}] not quiescing: 1 pending commands
I170817 16:17:32.110810 140 storage/replica.go:3067  [n5,s5,r2505/3:/Table/51/1/-1{94001…-89520…}] not quiescing: 1 pending commands
I170817 16:17:32.110859 144 storage/replica.go:3067  [n5,s5,r5002/3:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands
I170817 16:17:32.192577 133 storage/replica.go:3067  [n5,s5,r4144/1:/Table/51/1/311{39871…-84675…}] not quiescing: 1 pending commands
I170817 16:17:32.192615 136 storage/replica.go:3067  [n5,s5,r1055/1:/Table/51/1/234{21766…-66996…}] not quiescing: 1 pending commands
I170817 16:17:32.192651 139 storage/replica.go:3067  [n5,s5,r4534/1:/Table/51/1/888{53072…-98069…}] not quiescing: 1 pending commands
I170817 16:17:32.310790 131 storage/replica.go:3067  [n5,s5,r2505/3:/Table/51/1/-1{94001…-89520…}] not quiescing: 1 pending commands
I170817 16:17:32.310871 105 storage/replica.go:3067  [n5,s5,r5002/3:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands
I170817 16:10:00.510680 133 storage/replica.go:3067  [n6,s6,r5002/1:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands
I170817 16:10:00.710845 124 storage/replica.go:3067  [n6,s6,r2505/1:/Table/51/1/-1{94001…-89520…}] not quiescing: 1 pending commands
I170817 16:10:00.711337 119 storage/replica.go:3067  [n6,s6,r5002/1:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands
I170817 16:10:00.910374 137 storage/replica.go:3067  [n6,s6,r2505/1:/Table/51/1/-1{94001…-89520…}] not quiescing: 1 pending commands
I170817 16:10:00.910828 142 storage/replica.go:3067  [n6,s6,r5002/1:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands
I170817 16:10:01.111205 129 storage/replica.go:3067  [n6,s6,r2505/1:/Table/51/1/-1{94001…-89520…}] not quiescing: 1 pending commands
I170817 16:10:01.111370 122 storage/replica.go:3067  [n6,s6,r5002/1:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands
I170817 16:10:01.211252 139 storage/replica.go:3067  [n6,s6,r2505/1:/Table/51/1/-1{94001…-89520…}] not quiescing: 1 pending commands
I170817 16:10:01.211670 140 storage/replica.go:3067  [n6,s6,r5002/1:/Table/51/1/-19{84573…-40017…}] not quiescing: 1 pending commands

Both nodes 5 and 6 had this "not quiescing" log spam in the 16:00 and 18:00 hours, but only node 6 had it in the 19:00 hour (and everything looks fine in the 17:00 hour). Looking further back in the logs, it looks like something similar happened on 8/15, on node 4 this time, so the commit range above may be a red herring.

@tbg
Copy link
Member

tbg commented Aug 17, 2017

I'm just looking at https://cockroach-cyan-0001.crdb.io:8080/#/reports/range/1 and have seen multiple term+leader changes just refreshing a bunch of times.

@bdarnell
Copy link
Contributor Author

After I filed this issue, it stopped happening. So while there's definitely some issue to be investigated here, it's not as acute as it appeared on thursday.

Cyan is in trouble again right now (after an upgrade spanning these commits), with a slightly different failure mode: now the nodes are unable to update their liveness heartbeats.

@bdarnell
Copy link
Contributor Author

Raft ticks got to be very irregular around this time:

image

This persisted for a while after the cluster recovered, then went away on its own. It's unclear why (maybe #17617?)

@petermattis
Copy link
Collaborator

This persisted for a while after the cluster recovered, then went away on its own. It's unclear why (maybe #17617?)

The only way #17617 could be having an effect here is if Replica.mu was being held for a long period of time and I don't think that happens.

Hmm, we do hold Replica.mu across a few disk operations: Replica.maybeSetCorrupt, Replica.initRaftMuLockedReplicaMuLocked and Replica.setReplicaIDRaftMuLockedMuLocked.

@bdarnell
Copy link
Contributor Author

Some of the issues here (e.g. the ping-ponging of the range 1 lease between two replicas) are similar to what is seen in #17524. However, I don't think it can be the deadlock fixed in #17796 because the "have been waiting %s attempting to acquire quota" message never appears in the logs.

@tbg
Copy link
Member

tbg commented Aug 22, 2017

You can make sure by checking all the goroutine dumps for the relevant stack containing maybeAcquireProposalQuota.

@tbg
Copy link
Member

tbg commented Aug 22, 2017

or even better, a select in timer.go.

@nvanbenschoten
Copy link
Member

@bdarnell I wouldn't rule the possibility out because of that. The deadlock fixed in #17796 would happen just before the log message.

@bdarnell
Copy link
Contributor Author

OK. I'll try to catch a stack dump next time it happens, then (I didn't save one from last time)

@bdarnell
Copy link
Contributor Author

Stack dumps attached. No occurrence of maybeAcquireProposalQuota or timer.go.

cyanstacks.zip

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 1, 2017

I've reproduced this on adriatic by doing a bunch of rolling restarts (now with PreVote disabled). It looks like a small number of ranges get stuck in leader-not-leaseholder situations.

image
image
image

The leaseholder (node 5) has a lease request that is endlessly reproposing (for reasonTicks, except for one reasonNewLeaderOrConfigChange at the beginning):

2017/09/01 21:04:37.774844 | 1356.078250 | [async] storage.pendingLeaseRequest: requesting lease
-- | -- | --
21:04:37.774847 | .     4 | ... txnid:00f6f293-f3bb-490f-ae29-f6aef0848025
21:04:37.774876 | .    28 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] read-write path
21:04:37.774918 | .    43 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:04:37.793343 | . 18425 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] evaluated request
21:04:37.793354 | .    11 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:04:38.284993 | .491639 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] retry proposal 7b8f2de63bd730ae: reasonNewLeaderOrConfigChange
21:04:38.286033 | .  1040 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:04:38.286434 | .   401 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] evaluated request
21:04:38.286440 | .     5 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:04:40.974011 | 2.687571 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] retry proposal 596cbaf30588b342: reasonTicks
21:04:40.974171 | .   160 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:04:40.975472 | .  1301 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] evaluated request
21:04:40.975478 | .     6 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:04:43.974180 | 2.998702 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] retry proposal 4a71207ca4288f12: reasonTicks
21:04:43.975004 | .   824 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:04:43.975167 | .   163 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] evaluated request
21:04:43.975175 | .     8 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:04:46.974760 | 2.999585 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] retry proposal 5e6bf046e0bf4fce: reasonTicks
21:04:46.975148 | .   387 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:04:46.975281 | .   133 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] evaluated request
21:04:46.975287 | .     6 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:04:49.974233 | 2.998947 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] retry proposal 2e83db299f6c1979: reasonTicks
21:04:49.974625 | .   392 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:04:49.975096 | .   470 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] evaluated request
21:04:49.975102 | .     7 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:04:52.974680 | 2.999578 | ... [n5,s5,r680/1:/Table/51/1/67{52956…-70950…}] retry proposal 155808b6638d8bef: reasonTicks

Node 6 also has an endlessly-looping lease request:

2017/09/01 21:05:44.251439 | 2643.330366 | [async] storage.pendingLeaseRequest: requesting lease
-- | -- | --
21:05:44.251446 | .     7 | ... txnID:e053d438-be29-4ba4-9d9d-c6a1f8cfd279
21:05:44.251666 | .   221 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] read-write path
21:05:44.251760 | .    94 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:05:44.252148 | .   388 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] evaluated request
21:05:44.252157 | .     9 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:05:47.196910 | 2.944753 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] retry proposal 25a668687d25d222: reasonTicks
21:05:47.197014 | .   104 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:05:47.197186 | .   172 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] evaluated request
21:05:47.197196 | .    10 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:05:50.197276 | 3.000080 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] retry proposal 0c466cd96ac0ca11: reasonTicks
21:05:50.198462 | .  1186 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:05:50.198800 | .   338 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] evaluated request
21:05:50.198812 | .    13 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:05:53.205698 | 3.006886 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] retry proposal 1351d0e9e0c1d013: reasonTicks
21:05:53.206552 | .   854 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:05:53.206702 | .   149 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] evaluated request
21:05:53.206708 | .     7 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:05:56.196328 | 2.989620 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] retry proposal 1badab42245da721: reasonTicks
21:05:56.196545 | .   217 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] applied timestamp cache
21:05:56.196706 | .   161 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] evaluated request
21:05:56.196713 | .     7 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] acquired {raft,replica}mu
21:05:59.199920 | 3.003206 | ... [n6,s6,r680/3:/Table/51/1/67{52956…-70950…}] retry proposal 220ae9e9bdd65749: reasonTicks

Node 1 (the raft leader) doesn't have anything related to r680 in its active requests except requests that it is forwarding to node 5 (so it does, or at least did, recognize node 5's lease despite being the node that is marked as "no lease" in the range debug page).

@petermattis
Copy link
Collaborator

We're seeing this same problem on adriatic-06 right now:

2017/09/02 14:37:07.357471 | 6247.802504 | [async] storage.pendingLeaseRequest: requesting lease
-- | -- | --
14:37:07.357474 | .     3 | ... txnid:31311ebf-4b19-4a3c-a39e-cd838992dae2
14:37:09.482159 | 2.124685 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] read-write path
14:37:09.482196 | .    37 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] applied timestamp cache
14:37:09.482397 | .   201 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] evaluated request
14:37:09.482409 | .    12 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] acquired {raft,replica}mu
14:37:25.771457 | 16.289049 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] retry proposal 58aeec2b879cdee2: reasonNewLeader
14:37:25.771706 | .   248 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] applied timestamp cache
14:37:25.772344 | .   638 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] evaluated request
14:37:25.772352 | .     8 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] acquired {raft,replica}mu
14:37:55.046477 | 29.274125 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] retry proposal 39c0a32ab741b0aa: reasonTicks
14:37:55.046744 | .   267 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] applied timestamp cache
14:37:55.047016 | .   272 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] evaluated request
14:37:55.047024 | .     8 | ... [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] acquired {raft,replica}mu

The Range debug page for r119:

screen shot 2017-09-02 at 12 32 01 pm

So n6 is try to figure out who has the lease for r119 and is somehow failing to forward the proposal to n1 (the leader). Or perhaps it is forwarding the proposal but n1 isn't handling it. I'm going to turn on some replica and raft vmodule logs.

@petermattis
Copy link
Collaborator

With --vmodule=replica=4, on n6 I see:

I170902 16:49:17.878079 527 storage/replica.go:3073 [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] proposing command 57b2495fe93f8e4f: 1 RequestLease
I170902 16:49:17.879010 124 storage/replica.go:4020 [n6,s6,r119/3:/Table/51/1/-33{30130…-12356…}] sending raft request range_id:119 from_replica:<node_id:6 store_id:6 replica_id:3 > to_replica:<node_id:1 store_id:1 replica_id:2 > message:<type:MsgPropMsgProp to:2 from:3 term:0 logTerm:0 index:0 entries:<Term:0 Index:0 Type:EntryNormalEntryNormal Data:

Shortly after on n1:

I170902 16:49:17.880274 101 storage/replica.go:3111 [n1,s1,r119/2:/Table/51/1/-33{30130…-12356…}] unquiescing
I170902 16:49:17.880786 101 vendor/github.com/coreos/etcd/raft/raft.go:840 [n1,s1,r119/2:/Table/51/1/-33{30130…-12356…}] 2 [term 117] transfer leadership to 3 is in progress; dropping proposal

n1 is trying to transfer Raft leadership to n6 whenever it tries to quiesce r119. Because the leadership transfer is in progress, n1 is dropping the RequestLease proposal. Still to figure out is why the Raft leadership transfer is finishing.

@petermattis
Copy link
Collaborator

This isn't good, n1 is trying to quiesce r119. The unquiescing is due to n6 responding with a heartbeat. n1 shouldn't be trying to quiesce when it isn't the leaseholder:

I170902 16:49:20.769732 62 storage/replica.go:3098 [n1,s1,r119/2:/Table/51/1/-33{30130…-12356…}] quiescing
I170902 16:49:20.877099 82 storage/replica.go:3111 [n1,s1,r119/2:/Table/51/1/-33{30130…-12356…}] unquiescing

The code in question is:

	if l := *r.mu.state.Lease; !l.OwnedBy(r.store.StoreID()) &&
		r.isLeaseValidRLocked(l, r.store.Clock().Now()) {
		if log.V(4) {
			log.Infof(ctx, "not quiescing: not leaseholder")
		}

Should that be || !r.isLeaseValidRLocked()? Not clear to me why r.isLeaseValidRLocked is returning false, but it must be otherwise we wouldn't be quiescing.

Since we're quiescing, we're not ticking the Raft group on n1. Not ticking the Raft group means we're never timing out the leadership transfer.

@petermattis
Copy link
Collaborator

Ah, the lease for r119 is at epoch 33, but node liveness for n6 is at epoch 35. This is causing r.isLeaseValidRLocked() to return false.

petermattis added a commit to petermattis/cockroach that referenced this issue Sep 2, 2017
Fix check for whether we own a valid lease.

Fixes cockroachdb#17741
bdarnell pushed a commit to bdarnell/cockroach that referenced this issue Sep 2, 2017
Fix check for whether we own a valid lease.

Fixes cockroachdb#17741
petermattis added a commit to petermattis/cockroach that referenced this issue Sep 5, 2017
Fix check for whether we own a valid lease.

Fixes cockroachdb#17741
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants