-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: raft issues after multiple days of load #5970
Comments
Does the cluster remain hosed when you stop all load generators? Do the Might also try some manual On Sun, Apr 10, 2016 at 5:36 PM marc [email protected] wrote:
-- Tobias |
yup, first thing I tried was shutting down the load generators, no help so far.
And some very long-running sql, but also not much:
|
Raft is not completely dead; new leader leases are still being granted (at least on some ranges). In particular I see that range 1 is able to grant leader leases. The new snapshot logging is suspicious: we saw these every second or two for a while (grepped to just the snapshot lines):
Then at 15:15:45, the snapshot messages abruptly stop; there are none in the rest of the log (node0.log.txt.gz, which contains six more hours). The total size of the snapshots is 16MB, which is not enough to cause memory problems (this cluster is configured to split at 8MB, right? so it's above target but not too bad). The time taken to generate the snapshots, however, is a problem. If the raft thread is blocked for nearly a second at a time it's not going to be able to maintain its leadership. Snapshot messages are much less frequent in the other nodes logs, but in any case none of them generate snapshots in the last several hours. And none of them appear to be receiving any snapshots recently either, even when node 0 was generating them so frequently. The snapshots must be getting dropped somewhere that we don't have logging: maybe the call to In the goroutine stack traces I see a few threads sending RangeLookup RPCs. There have been some changes in that area recently; could something be getting stuck there? I also see that the |
One of the nodes has a ~60mb goroutine dump, mostly thanks to 20k goroutines in
|
should I keep this running? I'd like to at least restart the servers and throw more load at it. but if you still want to poke around, I can leave it be, I'm in the middle of bringing up a GCE beta cluster anyway. |
It's fine with me to restart them; I've gotten what I think I can from the running instances. Let's restart them one at a time and see if restarting one node is sufficient to get things unwedged or if we need to cycle them all. Those
The client-side counterparts to these goroutines are nowhere to be seen. It looks like a client-side stream is not getting shut down correctly even though its goroutine is exiting. |
Ok, restarts didn't help much.
Lots of things being throw into the logs:
|
Earlier I mentioned #5911 (a change to range descriptor lookups) as a possible cause of this problem, but looking back at the git log it was not present in 2c32112 where this issue was first reported, so it's not responsible. Is this a new bug since beta-20160407, or has it been around a while? (If it's new we may not want to publish a new beta this week). Looking at Gossip looks healthy in I'm going to restart some nodes with |
Some relevant logs from node 0 (I've turned off
Node 0 is replica ID 6 for range 1. Replica ID 4 is node 4, the one that is down. I'm not sure which node is replica ID 7. We see in this line
that node 0 is far ahead of replica 7 in the logs, so replica 7 cannot become leader. However, it keeps timing out, starting an election and incrementing its term faster than node 0 can complete an election and become leader. If the Whatever's going on, we're having a lot of contested elections; it's not just range 1. During the brief time that this logging was turned on, this node started 22140 elections and won 62 of them. This build includes a change in upstream raft that was intended to reduce the likelihood of contested elections. Two-phase elections (aka the PreVote RPC) in raft would mitigate the pathological behavior here. Increasing the random factor after a failed election should also help (and would be simpler). For a quick fix to bring the cluster back up we should be able to increase the tick interval and/or the raft election timeout. |
Blocking the processRaft goroutine for too long is problematic. In extreme cases it can cause heartbeats to be missed and new elections to start (a major cause of cockroachdb#5970). This commit moves the work of snapshot generation to an asynchronous goroutine. Fixes cockroachdb#6204.
Blocking the processRaft goroutine for too long is problematic. In extreme cases it can cause heartbeats to be missed and new elections to start (a major cause of cockroachdb#5970). This commit moves the work of snapshot generation to an asynchronous goroutine. Fixes cockroachdb#6204.
This was looking a lot better yesterday after the snapshot changes but it's having problems again today. It looks like we're still getting into cycles where nodes are pushing each others' terms up without being able to complete elections. |
Is there anything left to investigate here? Seems stale to me. |
Yeah, I don't think there's anything useful here any more. |
current built sha: 2c32112
beta cluster has been up for about a week now, with a few binary upgrades since then.
I've had photos running pretty much the whole time (albeit slowed down to a crawl), and recently restarted the block_writer with tolerate_errors.
Right now, the cluster is pretty much hosed, even the UI is failing to query with deadline exceeded.
At this point, all the nodes are spitting out errors such as:
As well as a lot of deadline exceeded errors:
I'm attaching the logs which include at least the info since the last restart, although some of the raft log index messages go back further.
Nodes are still running, but not much good.
node0.log.txt.gz
node1.log.txt.gz
node2.log.txt.gz
node3.log.txt.gz
node4.log.txt.gz
The text was updated successfully, but these errors were encountered: