-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: Zero QPS on cyan #17741
Comments
In node 1's logs, just before everything goes to zero, we see lots of RPCs (both reads and writes) hanging on a handful of different ranges:
During the zero-qps periods, everything is pretty stable:
The logs on nodes 2-4 look similar, at least for a first pass. Nodes 5 and 6 have a lot of quiescence-related log spam during the zero-qps periods (and before), involving the same ranges as the other nodes' hanging RPCs:
Both nodes 5 and 6 had this "not quiescing" log spam in the 16:00 and 18:00 hours, but only node 6 had it in the 19:00 hour (and everything looks fine in the 17:00 hour). Looking further back in the logs, it looks like something similar happened on 8/15, on node 4 this time, so the commit range above may be a red herring. |
I'm just looking at https://cockroach-cyan-0001.crdb.io:8080/#/reports/range/1 and have seen multiple term+leader changes just refreshing a bunch of times. |
After I filed this issue, it stopped happening. So while there's definitely some issue to be investigated here, it's not as acute as it appeared on thursday. Cyan is in trouble again right now (after an upgrade spanning these commits), with a slightly different failure mode: now the nodes are unable to update their liveness heartbeats. |
Raft ticks got to be very irregular around this time: This persisted for a while after the cluster recovered, then went away on its own. It's unclear why (maybe #17617?) |
The only way #17617 could be having an effect here is if Hmm, we do hold |
You can make sure by checking all the goroutine dumps for the relevant stack containing |
or even better, a |
OK. I'll try to catch a stack dump next time it happens, then (I didn't save one from last time) |
Stack dumps attached. No occurrence of maybeAcquireProposalQuota or timer.go. |
I've reproduced this on The leaseholder (node 5) has a lease request that is endlessly reproposing (for
Node 6 also has an endlessly-looping lease request:
Node 1 (the raft leader) doesn't have anything related to r680 in its active requests except requests that it is forwarding to node 5 (so it does, or at least did, recognize node 5's lease despite being the node that is marked as "no lease" in the range debug page). |
We're seeing this same problem on
The Range debug page for So |
With
Shortly after on
|
This isn't good,
The code in question is:
Should that be Since we're quiescing, we're not ticking the Raft group on |
Ah, the lease for |
Fix check for whether we own a valid lease. Fixes cockroachdb#17741
Fix check for whether we own a valid lease. Fixes cockroachdb#17741
Fix check for whether we own a valid lease. Fixes cockroachdb#17741
The cyan (continuous deployment) cluster has been seeing zero qps for most of the last four hours.
The start of the problems corresponds to an upgrade spanning the following range of commits (more than we'd like to see in continuous deployment, but the CI server was having trouble): 8cf2141...c13339d
The text was updated successfully, but these errors were encountered: