Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: raft issues after multiple days of load #5970

Closed
mberhault opened this issue Apr 10, 2016 · 12 comments
Closed

stability: raft issues after multiple days of load #5970

mberhault opened this issue Apr 10, 2016 · 12 comments
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@mberhault
Copy link
Contributor

current built sha: 2c32112
beta cluster has been up for about a week now, with a few binary upgrades since then.

I've had photos running pretty much the whole time (albeit slowed down to a crawl), and recently restarted the block_writer with tolerate_errors.

Right now, the cluster is pretty much hosed, even the UI is failing to query with deadline exceeded.

At this point, all the nodes are spitting out errors such as:

W160410 21:32:51.112347 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (1092) for range 1364

As well as a lot of deadline exceeded errors:

E160410 21:32:53.183333 server/status.go:514  failed to send RPC: too many errors encountered (3 of 3 total): rpc error: code = 4 desc = context deadline exceeded

I'm attaching the logs which include at least the info since the last restart, although some of the raft log index messages go back further.

Nodes are still running, but not much good.
node0.log.txt.gz
node1.log.txt.gz
node2.log.txt.gz
node3.log.txt.gz
node4.log.txt.gz

@mberhault mberhault changed the title stability: raft issues after multiple days of heavy load stability: raft issues after multiple days of load Apr 10, 2016
@tbg
Copy link
Member

tbg commented Apr 10, 2016

Does the cluster remain hosed when you stop all load generators? Do the
various endpoints under /debug/ show anything interesting? In particular,
look for long-running traces (maybe internal retries are piling up load on
top of load).

Might also try some manual EXPLAIN (TRACE) SELECT * FROM <something> LIMIT 1.

On Sun, Apr 10, 2016 at 5:36 PM marc [email protected] wrote:

current built sha: 2c32112
2c32112
beta cluster has been up for about a week now, with a few binary upgrades
since then.

I've had photos running pretty much the whole time (albeit slowed down to
a crawl), and recently restarted the block_writer with tolerate_errors.

Right now, the cluster is pretty much hosed, even the UI is failing to
query with deadline exceeded.

At this point, all the nodes are spitting out errors such as:

W160410 21:32:51.112347 storage/raft_log_queue.go:116 storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (1092) for range 1364

As well as a lot of deadline exceeded errors:

E160410 21:32:53.183333 server/status.go:514 failed to send RPC: too many errors encountered (3 of 3 total): rpc error: code = 4 desc = context deadline exceeded

I'm attaching the logs which include at least the info since the last
restart, although some of the raft log index messages go back further.

Nodes are still running, but not much good.
node0.log.txt.gz
https://github.com/cockroachdb/cockroach/files/212016/node0.log.txt.gz
node1.log.txt.gz
https://github.com/cockroachdb/cockroach/files/212017/node1.log.txt.gz
node2.log.txt.gz
https://github.com/cockroachdb/cockroach/files/212018/node2.log.txt.gz
node3.log.txt.gz
https://github.com/cockroachdb/cockroach/files/212020/node3.log.txt.gz
node4.log.txt.gz
https://github.com/cockroachdb/cockroach/files/212019/node4.log.txt.gz


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5970

-- Tobias

@mberhault
Copy link
Contributor Author

yup, first thing I tried was shutting down the load generators, no help so far.
There is a bunch of active stuff, but not much visibility:

2016/04/10 18:21:29.388562  12734.285710    (*Replica).Send
18:21:29.388569  .     7    ... admin path
18:21:29.388574  .     5    ... split begins
18:21:29.388935  .   362    ... found split key
2016/04/10 18:21:29.378723  12734.295666    split
18:21:29.378735  .    12    ... request leader lease (attempt #1)
2016/04/10 18:19:54.960327  12828.714107    replicate
18:19:54.960339  .    12    ... request leader lease (attempt #1)

And some very long-running sql, but also not much:

... skipping a few
2016/04/10 21:10:04.794624  2692.141457 
21:10:04.796261  .  1637    [redacted]
21:10:07.241128 2.444867    ... ERROR: failed to send RPC: too many errors encountered (3 of 3 total): rpc error: code = 4 desc = context deadline exceeded
2016/04/10 14:49:31.097265  25525.838880    
14:49:31.098626  .  1361    [redacted]
14:49:34.483643 3.385017    ... SELECT 0
2016/04/10 10:00:37.911235  42859.024998    
10:00:37.916731  .  5496    [redacted]
10:00:42.038170 4.121438    ... SELECT 0

@bdarnell
Copy link
Contributor

Raft is not completely dead; new leader leases are still being granted (at least on some ranges). In particular I see that range 1 is able to grant leader leases.

The new snapshot logging is suspicious: we saw these every second or two for a while (grepped to just the snapshot lines):

I160410 15:57:41.217037 storage/replica_raftstorage.go:408  generated snapshot for range 935 at index 47810 in 548.084124ms. encoded size=16867493, 22760 KV pairs, 40760 log entries
I160410 15:57:42.225531 storage/replica_raftstorage.go:408  generated snapshot for range 281 at index 45998 in 799.799166ms. encoded size=16208299, 23037 KV pairs, 37278 log entries
I160410 15:57:42.993298 storage/replica_raftstorage.go:408  generated snapshot for range 492 at index 50170 in 702.356633ms. encoded size=17201984, 23072 KV pairs, 42687 log entries
I160410 15:57:44.375823 storage/replica_raftstorage.go:408  generated snapshot for range 935 at index 47842 in 564.862804ms. encoded size=16873189, 22760 KV pairs, 40792 log entries
I160410 15:57:45.374257 storage/replica_raftstorage.go:408  generated snapshot for range 281 at index 46030 in 959.487556ms. encoded size=16213995, 23037 KV pairs, 37310 log entries

Then at 15:15:45, the snapshot messages abruptly stop; there are none in the rest of the log (node0.log.txt.gz, which contains six more hours).

The total size of the snapshots is 16MB, which is not enough to cause memory problems (this cluster is configured to split at 8MB, right? so it's above target but not too bad). The time taken to generate the snapshots, however, is a problem. If the raft thread is blocked for nearly a second at a time it's not going to be able to maintain its leadership.

Snapshot messages are much less frequent in the other nodes logs, but in any case none of them generate snapshots in the last several hours. And none of them appear to be receiving any snapshots recently either, even when node 0 was generating them so frequently. The snapshots must be getting dropped somewhere that we don't have logging: maybe the call to canApplySnapshot in Store.handleRaftMessage?

In the goroutine stack traces I see a few threads sending RangeLookup RPCs. There have been some changes in that area recently; could something be getting stuck there? I also see that the processRaft goroutine is (mostly) idle, so whatever stopped the snapshots wasn't a deadlock.

@tbg
Copy link
Member

tbg commented Apr 11, 2016

One of the nodes has a ~60mb goroutine dump, mostly thanks to 20k goroutines in RaftMessage:

[...]
19440: select [995 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:324]
    storage   raft_transport.go:132        (*RaftTransport).RaftMessage(#314, #162912, *, 0, 0)
    storage   raft.pb.go:155               _MultiRaft_RaftMessage_Handler(#6, #314, #162911, *, 0, 0)
    grpc      server.go:575                (*Server).processStreamingRPC(#244, #162910, *, *, #217, #30, 0, 0, 0)
    grpc      server.go:650                (*Server).handleStream(#244, #162910, *, *, 0)
    grpc      server.go:323                (*Server).serveStreams.func1.1(*, #244, #162910, *, *)
[...]
25: select [981 minutes] [Created by stop.(*Stopper).RunAsyncTask @ stopper.go:182]
    storage   replica.go:492               (*Replica).requestLeaderLease.func1.1(*, #162893, *, 0x618011)
    storage   replica.go:455               (*Replica).requestLeaderLease.func1()
    stop      stopper.go:181               (*Stopper).RunAsyncTask.func1(#854, #26, 0x12, 0x200, *)
[...]

@mberhault
Copy link
Contributor Author

should I keep this running? I'd like to at least restart the servers and throw more load at it. but if you still want to poke around, I can leave it be, I'm in the middle of bringing up a GCE beta cluster anyway.

@bdarnell
Copy link
Contributor

It's fine with me to restart them; I've gotten what I think I can from the running instances. Let's restart them one at a time and see if restarting one node is sufficient to get things unwedged or if we need to cycle them all.

Those RaftMessage goroutines on node 2 are unexpected, as are the counterpart workers:

19439 @ 0x5fc5d3 0x60b3d7 0x60a932 0xf639e6 0xf64411 0x6859a6 0x685ae2 0xd6b829 0xd6c6f5 0xd776e4 0x83e61e 0x8a312b 0x8a33e8 0x8e11c2 0x62d6c1
#       0xf639e6        google.golang.org/grpc/transport.(*recvBufferReader).Read+0x7e6                         /go/src/google.golang.org/grpc/transport/transport.go:141
#       0xf64411        google.golang.org/grpc/transport.(*Stream).Read+0x71                                    /go/src/google.golang.org/grpc/transport/transport.go:294
#       0x6859a6        io.ReadAtLeast+0xe6                                                                     /usr/local/go/src/io/io.go:297
#       0x685ae2        io.ReadFull+0x62                                                                        /usr/local/go/src/io/io.go:315
#       0xd6b829        google.golang.org/grpc.(*parser).recvMsg+0xb9                                           /go/src/google.golang.org/grpc/rpc_util.go:218
#       0xd6c6f5        google.golang.org/grpc.recv+0x45                                                        /go/src/google.golang.org/grpc/rpc_util.go:302
#       0xd776e4        google.golang.org/grpc.(*serverStream).RecvMsg+0xe4                                     /go/src/google.golang.org/grpc/stream.go:411
#       0x83e61e        github.com/cockroachdb/cockroach/storage.(*multiRaftRaftMessageServer).Recv+0x7e        /go/src/github.com/cockroachdb/cockroach/storage/raft.pb.go:174
#       0x8a312b        github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage.func1.1.1+0x4b    /go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:111
#       0x8a33e8        github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage.func1.1+0x48      /go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:128
#       0x8e11c2        github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1+0x52              /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139

The client-side counterparts to these goroutines are nowhere to be seen. It looks like a client-side stream is not getting shut down correctly even though its goroutine is exiting.

@mberhault
Copy link
Contributor Author

Ok, restarts didn't help much.
We have long-running opts on all nodes failing with unable to lookup ranges or unable to acquire leader leases.

2016/04/11 18:15:17.923772  5.653404    txn coordinator
18:15:17.923907  .   136    ... meta descriptor lookup
18:15:17.924025  .   118    ... sending RPC to [/Min, /System/"tsd\x12cr.node.clock-offset.upper-bound-nanos\x00\x01\x89\xf8\x060J3")
18:15:20.924308 3.000283    ... error, trying next peer
2016/04/11 18:13:02.974418  140.602951  txn coordinator
18:13:02.974454  .    36    ... meta descriptor lookup
18:13:02.975664  .  1210    ... sending RPC to [/Table/51/1/8251526609206106940/"6a533d7c-76c2-4d79-b838-7c88c7243432"/264423, /Table/51/1/8269692506984464053/"6a533d7c-76c2-4d79-b838-7c88c7243432"/306510)
18:13:05.976438 3.000774    ... error, trying next peer
18:13:08.976636 3.000198    ... error, trying next peer
18:15:15.235423 3.000230    ... (82 events discarded)
18:15:18.235644 3.000221    ... error, trying next peer
18:15:21.235867 3.000223    ... error, trying next peer
18:15:21.237287  .  1420    ... reply error: *roachpb.RangeNotFoundError
18:15:21.237338  .    51    ... meta descriptor lookup
18:15:21.237410  .    73    ... sending RPC to [/Table/51/1/8251526609206106940/"6a533d7c-76c2-4d79-b838-7c88c7243432"/264423, /Table/51/1/8269692506984464053/"6a533d7c-76c2-4d79-b838-7c88c7243432"/306510)
2016/04/11 18:09:50.952033  332.625870  replicate
18:09:50.952052  .    20    ... request leader lease (attempt #1)
18:09:54.715607 3.763555    ... request leader lease (attempt #2)
18:09:56.919720 2.204113    ... request leader lease (attempt #3)
18:09:59.103851 2.184132    ... request leader lease (attempt #4)
18:12:16.245212 2.245566    ... (45 events discarded)
18:12:18.684175 2.438964    ... request leader lease (attempt #50)
18:12:21.851104 3.166929    ... request leader lease (attempt #51)
18:12:24.999594 3.148490    ... request leader lease (attempt #52)
18:12:27.553002 2.553408    ... request leader lease (attempt #53)
18:12:31.456912 3.903910    ... request leader lease (attempt #54)
2016/04/11 18:01:54.798955  808.779460  replica consistency checker
18:01:54.798998  .    42    ... request leader lease (attempt #1)
2016/04/11 18:00:28.766352  894.812152  split
18:00:28.766391  .    39    ... request leader lease (attempt #1)

Lots of things being throw into the logs:

W160411 18:15:46.659002 gossip/gossip.go:889  first range unavailable or cluster not initialized
W160411 18:15:46.890878 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's old
est index (0) is less than the first index (9187) for range 660
W160411 18:15:46.945653 server/node.go:598  error recording status summaries: failed to send RPC: to
o many errors encountered (3 of 3 total): rpc error: code = 4 desc = context deadline exceeded
W160411 18:15:46.945817 ts/db.go:98  error writing time series data: failed to send RPC: too many errors encountered (3 of 3 total): rpc error: code = 4 desc = context deadline exceeded
W160411 18:15:47.245894 server/node.go:665  [node=1] unable to log node_restart event for node 1: failed to send RPC: too many errors encountered (3 of 3 total): rpc error: code = 4 desc = context dea
dline exceeded

@bdarnell
Copy link
Contributor

Earlier I mentioned #5911 (a change to range descriptor lookups) as a possible cause of this problem, but looking back at the git log it was not present in 2c32112 where this issue was first reported, so it's not responsible. Is this a new bug since beta-20160407, or has it been around a while? (If it's new we may not want to publish a new beta this week). Looking at git log beta-20160407..2c32112, the most suspicious changes that could potentially produce these symptoms are gossip-related: #5901 and its follow-ups.

Gossip looks healthy in /_status/gossip/local: the cluster-id and the first range descriptor are present on all nodes (except node 4 which has crashed). The sentinel is not present, though, which means that none of the nodes holding a replica for range 1 have been able to get a lease. Other ranges, however, have been able to get leases.

I'm going to restart some nodes with raft=1 added to --vmodule to see if we can figure out anything about what's going on with raft elections. I'm leaving node 4 down per @tschottdorf 's request in #6000.

@bdarnell
Copy link
Contributor

Some relevant logs from node 0 (I've turned off raft=1 because it's incredibly verbose):

I160413 05:12:15.658024 raft/raft.go:469  [group 1] 6 became follower at term 10154
I160413 05:12:15.658214 raft/raft.go:238  [group 1] newRaft 6 [peers: [4,6,7], term: 10154, commit: 46548, applied: 46548, lastindex: 328170, lastterm: 182]
I160413 05:12:16.788710 raft/raft.go:571  [group 1] 6 [term: 10154] received a MsgVote message with higher term from 7 [term: 10210
]
I160413 05:12:16.788732 raft/raft.go:469  [group 1] 6 became follower at term 10210
I160413 05:12:16.788915 raft/raft.go:807  [group 1] 6 [logterm: 182, index: 328170, vote: 0] rejected vote from 7 [logterm: 99, ind
ex: 46548] at term 10210
I160413 05:12:22.220802 raft/raft.go:549  [group 1] 6 is starting a new election at term 10210
I160413 05:12:22.220847 raft/raft.go:482  [group 1] 6 became candidate at term 10211
I160413 05:12:22.220864 raft/raft.go:531  [group 1] 6 received vote from 6 at term 10211
I160413 05:12:22.221215 raft/raft.go:524  [group 1] 6 [logterm: 182, index: 328170] sent vote request to 4 at term 10211
I160413 05:12:22.221479 raft/raft.go:524  [group 1] 6 [logterm: 182, index: 328170] sent vote request to 7 at term 10211
I160413 05:12:26.411202 raft/raft.go:571  [group 1] 6 [term: 10211] received a MsgVote message with higher term from 7 [term: 10213]
I160413 05:12:26.411552 raft/raft.go:469  [group 1] 6 became follower at term 10213
I160413 05:12:26.412114 raft/raft.go:807  [group 1] 6 [logterm: 182, index: 328170, vote: 0] rejected vote from 7 [logterm: 99, index: 46548] at term 10213

Node 0 is replica ID 6 for range 1. Replica ID 4 is node 4, the one that is down. I'm not sure which node is replica ID 7. We see in this line

I160413 05:12:16.788915 raft/raft.go:807  [group 1] 6 [logterm: 182, index: 328170, vote: 0] rejected vote from 7 [logterm: 99, index: 46548] at term 10210

that node 0 is far ahead of replica 7 in the logs, so replica 7 cannot become leader. However, it keeps timing out, starting an election and incrementing its term faster than node 0 can complete an election and become leader.

If the processRaft goroutine were getting blocked, so that the ticker wasn't getting handled in a timely fashion, that would probably produce these results. But it looks like the processRaft goroutine is spending most of its time idling in its select loop, so that doesn't appear to be the case. (#6008 and #6012 could definitely contribute to problems here, but we don't see evidence that they are doing so in this case) It's also possible that the extremely verbose logging is distorting what I'm looking for; the processRaft goroutine may have never become idle due to the logging, which caused different failures in this configuration than are seen in regular use.

Whatever's going on, we're having a lot of contested elections; it's not just range 1. During the brief time that this logging was turned on, this node started 22140 elections and won 62 of them. This build includes a change in upstream raft that was intended to reduce the likelihood of contested elections.

Two-phase elections (aka the PreVote RPC) in raft would mitigate the pathological behavior here. Increasing the random factor after a failed election should also help (and would be simpler). For a quick fix to bring the cluster back up we should be able to increase the tick interval and/or the raft election timeout.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 22, 2016
Blocking the processRaft goroutine for too long is problematic. In
extreme cases it can cause heartbeats to be missed and new elections to
start (a major cause of cockroachdb#5970). This commit moves the work of snapshot
generation to an asynchronous goroutine.

Fixes cockroachdb#6204.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 25, 2016
Blocking the processRaft goroutine for too long is problematic. In
extreme cases it can cause heartbeats to be missed and new elections to
start (a major cause of cockroachdb#5970). This commit moves the work of snapshot
generation to an asynchronous goroutine.

Fixes cockroachdb#6204.
@bdarnell
Copy link
Contributor

This was looking a lot better yesterday after the snapshot changes but it's having problems again today. It looks like we're still getting into cycles where nodes are pushing each others' terms up without being able to complete elections.

@tamird tamird added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label May 5, 2016
@petermattis petermattis added this to the Q3 milestone Jul 11, 2016
@tamird
Copy link
Contributor

tamird commented Sep 9, 2016

Is there anything left to investigate here? Seems stale to me.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 9, 2016

Yeah, I don't think there's anything useful here any more.

@bdarnell bdarnell closed this as completed Sep 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

5 participants