-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/nodeShutdown/worker failed #80821
Comments
The restore job failed on node 2, which picked up the job after node 3 shut down during execution. A first pass of the logs suggests a potential problem in kv land. It seems that node 2 picked the job up cleanly:
But. node 2's logs suggests some problems handling the shut down node's raft queue ( note
Immediately after the log messages above, this log line appears I do wish this error message was a bit more descriptive. This error message was formed when the See: restore_job:777
|
The errors are a consequence of n3 having been shut down by the test in the moments prior to the job failing. They don't suggest a KV problem per se. Context cancellation is a real annoyance to track down the source of. I worked on, but never merged, a way to possibly make this better here. Luckily, there is an active proposal to add context cancellation reasons upstream. This doesn't help today, but hopefully one day it will. In the meantime, from a KV point of view, this isn't likely to be a blocker, but if Bulk can figure out what exactly timed out here we could take a closer look. Given that this is 21.1, which isn't seeing many changes, I would not mark this as a blocker given the low risk of having broken anything. |
@tbg thank you for the explanation! I'll remove the release blocker and will reinvestigate if another failure pops up. |
Taking KV off @msbutler until / if you find something pointing to KV then feel free to kick it back over |
roachtest.restore/nodeShutdown/worker failed with artifacts on release-21.1 @ f275355bdb6b1c4698185c2ad003298b149359ec:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh restore/nodeShutdown/worker
This test on roachdash | Improve this report!
Jira issue: CRDB-15502
The text was updated successfully, but these errors were encountered: