-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: 20k+ goroutines deadlocked on beginCmd, no more progress #5368
Comments
Tasks running on the offending node:
Somehow all servers and clients are happy (regular leader lease activity as well), but nothing is really going forward. Losing a Raft command could explain some of that (blocking all client commands, but not blocking internal stuff), I suppose.
logs copied and data copied to |
the UI sometimes gets its data, but then cuts out irregularly. So it's not completely happy. Wonder what's causing that - I would've expected it to work or not, but nothing in between. It shows three nodes up and this one as down: ip-172-31-58-172:26257 (it isn't). |
similar for other commands. Not sure what's left to try here. Looks like we've got all the info out of this? |
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.
In cockroachdb#5368, I was seeing that a call to `redirectOnOrAcquireLeaderLease` had been stuck for 541 minutes. This function selects on a leader lease channel, but also selects on a context cancellation. This means that the context should have timed out. It looks like we had dropped the original context with timeout in `Node.Batch`, which came from `kv.sendOne`. This change should properly link these two contexts together so that the timeout in the stuck command would work correctly.
Fixed by #5551, right? |
(to be updated shortly; want an issue number)
The text was updated successfully, but these errors were encountered: