Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

Closed
tbg opened this issue Mar 18, 2016 · 4 comments
Closed

stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368

tbg opened this issue Mar 18, 2016 · 4 comments
Assignees
Milestone

Comments

@tbg
Copy link
Member

tbg commented Mar 18, 2016

(to be updated shortly; want an issue number)

@tbg tbg added this to the Beta milestone Mar 18, 2016
@tbg tbg changed the title stability: 100k+ goroutines deadlocked on beginCmd, no more progress stability: 20k+ goroutines deadlocked on beginCmd, no more progress Mar 18, 2016
@tbg
Copy link
Member Author

tbg commented Mar 18, 2016

Tasks running on the offending node:

40539  server/node.go:701
1      storage/queue.go:380

Somehow all servers and clients are happy (regular leader lease activity as well), but nothing is really going forward. Losing a Raft command could explain some of that (blocking all client commands, but not blocking internal stuff), I suppose.

grep -c beginCmds *.goroutine
big_guy.ec2-54-208-199-194.compute-1.amazonaws.com.goroutine:24848
ec2-54-209-133-121.compute-1.amazonaws.com.goroutine:1
ec2-54-209-150-36.compute-1.amazonaws.com.goroutine:0
ec2-54-209-69-52.compute-1.amazonaws.com.goroutine:2

net/trace shows about 40+k active traces (and growing, which makes sense since DistSender uses hard timeouts and then retries). Queue activity is "boring" - only the Raft log queue ran (no GC has taken place yet).

logs copied and data copied to logs.5368 on the server. Goroutines etc attached.
Archive.zip

@tbg
Copy link
Member Author

tbg commented Mar 18, 2016

the UI sometimes gets its data, but then cuts out irregularly. So it's not completely happy. Wonder what's causing that - I would've expected it to work or not, but nothing in between.

It shows three nodes up and this one as down: ip-172-31-58-172:26257 (it isn't).

@tbg
Copy link
Member Author

tbg commented Mar 18, 2016

ubuntu@ip-172-31-58-174:~$ ./cockroach debug range --ca-cert certs/ca.crt --key certs/root.client.key --cert certs/root.client.crt ls                                                │ubuntu@ip-172-31-58-173:~$ ./cockroach debug range --ca-cert certs/ca.crt --key certs/root.client.key --cert certs/root.client.crt ls
Error: scan failed: client/rpc_sender.go:58: roachpb.Batch RPC failed: context deadline exceeded                                                                                     │Error: scan failed: client/rpc_sender.go:58: roachpb.Batch RPC failed: context deadline exceeded
                                                                                                                                                                                     │
Failed running "debug"

similar for other commands. Not sure what's left to try here. Looks like we've got all the info out of this?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 24, 2016
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease`
had been stuck for 541 minutes. This function selects on a leader lease
channel, but also selects on a context cancellation. This means that the
context should have timed out. It looks like we had dropped the original
context with timeout in `Node.Batch`, which came from `kv.sendOne`. This
change should properly link these two contexts together so that the
timeout in the stuck command would work correctly.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 25, 2016
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease`
had been stuck for 541 minutes. This function selects on a leader lease
channel, but also selects on a context cancellation. This means that the
context should have timed out. It looks like we had dropped the original
context with timeout in `Node.Batch`, which came from `kv.sendOne`. This
change should properly link these two contexts together so that the
timeout in the stuck command would work correctly.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 25, 2016
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease`
had been stuck for 541 minutes. This function selects on a leader lease
channel, but also selects on a context cancellation. This means that the
context should have timed out. It looks like we had dropped the original
context with timeout in `Node.Batch`, which came from `kv.sendOne`. This
change should properly link these two contexts together so that the
timeout in the stuck command would work correctly.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 25, 2016
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease`
had been stuck for 541 minutes. This function selects on a leader lease
channel, but also selects on a context cancellation. This means that the
context should have timed out. It looks like we had dropped the original
context with timeout in `Node.Batch`, which came from `kv.sendOne`. This
change should properly link these two contexts together so that the
timeout in the stuck command would work correctly.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 26, 2016
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease`
had been stuck for 541 minutes. This function selects on a leader lease
channel, but also selects on a context cancellation. This means that the
context should have timed out. It looks like we had dropped the original
context with timeout in `Node.Batch`, which came from `kv.sendOne`. This
change should properly link these two contexts together so that the
timeout in the stuck command would work correctly.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 26, 2016
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease`
had been stuck for 541 minutes. This function selects on a leader lease
channel, but also selects on a context cancellation. This means that the
context should have timed out. It looks like we had dropped the original
context with timeout in `Node.Batch`, which came from `kv.sendOne`. This
change should properly link these two contexts together so that the
timeout in the stuck command would work correctly.
@bdarnell
Copy link
Contributor

Fixed by #5551, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants