-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpch_concurrency failed #79469
Comments
I think we're stuck on query 13 on node 1 trying to acquire the file descriptors from the semaphore:
We have seen this in a customer setting, but the issue was fixed with the cancellation fixes several months ago. I wonder we' are in a legitimate livelock or whether we again have cancellation issues (like #79084 is somewhat plausible culprit). My guess it's the former, so we might want to just bump the default from 256 FDs to say 1024. I don't think it's a release blocker, so I'm removing the corresponding label. |
Indeed, I think #79084 introduced a regression with cancellation. I have a fix. Adding a release-blocker label to 22.1 and 21.2 branches. |
Could you say a bit more about how you determined that #79084 caused this? |
I think I have a decent explanation in the commit message in #79716, copy-pasting (with some edits) for convenience. #79084 fixed the way inboxes handle regular query errors so that now the gRPC In particular, the following sequence of events can happen:
Thus, we get into a state that the inbox is never drained, so the What this test encountered was exactly the leak of file descriptors. |
Thanks for the explanation. Should we revert the corresponding PR on 22.1 as well? I think any more comprehensive fix will likely need to wait until 22.1.1. |
Yes, I think so. It didn't seem as urgent on Friday, so I delayed the revert till today. |
roachtest.tpch_concurrency failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-14877
The text was updated successfully, but these errors were encountered: