stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

tbg · 2016-03-18T03:09:28Z

(to be updated shortly; want an issue number)

tbg · 2016-03-18T03:36:16Z

Tasks running on the offending node:

40539  server/node.go:701
1      storage/queue.go:380

Somehow all servers and clients are happy (regular leader lease activity as well), but nothing is really going forward. Losing a Raft command could explain some of that (blocking all client commands, but not blocking internal stuff), I suppose.

grep -c beginCmds *.goroutine
big_guy.ec2-54-208-199-194.compute-1.amazonaws.com.goroutine:24848
ec2-54-209-133-121.compute-1.amazonaws.com.goroutine:1
ec2-54-209-150-36.compute-1.amazonaws.com.goroutine:0
ec2-54-209-69-52.compute-1.amazonaws.com.goroutine:2

net/trace shows about 40+k active traces (and growing, which makes sense since DistSender uses hard timeouts and then retries). Queue activity is "boring" - only the Raft log queue ran (no GC has taken place yet).

logs copied and data copied to logs.5368 on the server. Goroutines etc attached.
Archive.zip

tbg · 2016-03-18T03:44:47Z

the UI sometimes gets its data, but then cuts out irregularly. So it's not completely happy. Wonder what's causing that - I would've expected it to work or not, but nothing in between.

It shows three nodes up and this one as down: ip-172-31-58-172:26257 (it isn't).

tbg · 2016-03-18T03:50:59Z

ubuntu@ip-172-31-58-174:~$ ./cockroach debug range --ca-cert certs/ca.crt --key certs/root.client.key --cert certs/root.client.crt ls                                                │ubuntu@ip-172-31-58-173:~$ ./cockroach debug range --ca-cert certs/ca.crt --key certs/root.client.key --cert certs/root.client.crt ls
Error: scan failed: client/rpc_sender.go:58: roachpb.Batch RPC failed: context deadline exceeded                                                                                     │Error: scan failed: client/rpc_sender.go:58: roachpb.Batch RPC failed: context deadline exceeded
                                                                                                                                                                                     │
Failed running "debug"

similar for other commands. Not sure what's left to try here. Looks like we've got all the info out of this?

In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.

bdarnell · 2016-03-31T23:51:32Z

Fixed by #5551, right?

tbg added this to the Beta milestone Mar 18, 2016

tbg changed the title ~~stability: 100k+ goroutines deadlocked on beginCmd, no more progress~~ stability: 20k+ goroutines deadlocked on beginCmd, no more progress Mar 18, 2016

tbg mentioned this issue Mar 18, 2016

stability: improve GC efficiency #5369

Closed

tbg assigned nvanbenschoten Mar 23, 2016

tbg mentioned this issue Mar 23, 2016

stability: split constantly failing with "range is already split" #5501

Closed

nvanbenschoten mentioned this issue Mar 24, 2016

storage: Link Node.Batch context correctly to timeout cmds #5551

Merged

nvanbenschoten mentioned this issue Mar 25, 2016

stability: OOM in applySnapshot #5467

Closed

bdarnell closed this as completed Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

tbg commented Mar 18, 2016

tbg commented Mar 18, 2016

tbg commented Mar 18, 2016

tbg commented Mar 18, 2016

bdarnell commented Mar 31, 2016

stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

Comments

tbg commented Mar 18, 2016

tbg commented Mar 18, 2016

tbg commented Mar 18, 2016

tbg commented Mar 18, 2016

bdarnell commented Mar 31, 2016