-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787
Comments
The BACKUP is retrying once with:
but then it appears to fail with: My suspicion is that we are still planning a spec on node 2 (the node we shutdown) on retry, and when distsql tries to set up a flow it realizes it cannot dial the node. Ideally, |
In cockroachdb#68787 we saw a backup job failing on retry because of attempting to plan a flow on a node that has been shutdown. This should not usually happen since everytime we plan the flow we fetch the nodes that can participate in the distsql flow. In case we do encounter a `breaker open` error we can retry hoping that the next time the flow is planned it doesn't attempt to dial the dead node. Release note: None
68899: utilccl: add ErrBreakerOpen to bulk retryable error r=pbardea a=adityamaru In #68787 we saw a backup job failing on retry because of attempting to plan a flow on a node that has been shutdown. This should not usually happen since everytime we plan the flow we fetch the nodes that can participate in the distsql flow. In case we do encounter a `breaker open` error we can retry hoping that the next time the flow is planned it doesn't attempt to dial the dead node. Release note: None 68961: backupccl: deflake TestBackupRestoreSystemJobProgress r=pbardea a=adityamaru In the case of backup we only update the jobs fraction progressed if we have exported a complete span. This change adds some logic to wait until atleast one complete span has been exported, before checking for a progress update. Release note: None Co-authored-by: Aditya Maru <[email protected]>
roachtest.backup/nodeShutdown/coordinator/n4cpu4 failed with artifacts on release-21.1 @ c4d0e7baee3925541eed599ae771abb95c97732b:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh backup/nodeShutdown/coordinator/n4cpu4
This test on roachdash | Improve this report!
Jira issue: CRDB-9262
The text was updated successfully, but these errors were encountered: