-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/importer: TestImportWorkerFailure failed #102839
Comments
The full failure looks like this:
|
I wasn't able to repro this although the failure has happened twice. I wonder if someone with strong jobs mojo could take a look? |
Using stress after 2885 runs I got:
|
Assigning @cucaroach to skip this test (and mark this issue with skipped-test label), can put it on Bugs to Fix after if the fix is not obvious. |
on the first one that complained about transport closing, I suspect that is this My guess is that this SELECT is being distributed? and thus when a node (other than the gateway) is terminated, the execution of the distributed query fails, but this code doesn't retry it and assumes any error is a failure of the job and so the job fails? I have no idea at all about that second one -- that one looks just... wrong... right? |
Thanks for looking! I believe the internal executor will always run with DistSQLMode of 0 (ie off) but maybe the distsender GRPC failed? Its a little surprising a raw GRPC error made it all the way to the results iterator w/o getting wrapped/adorned with more info, if I could repro it I could try to fix that. |
Informs cockroachdb#102839 Release note: none
103016: kvserver: deflake TestAcquireLeaseTimeout r=erikgrinaker a=tbg There are a few callers that use non-cancelable contexts, and if one of them got into the test we'd deadlock. Touches #102975. (The backport will close it). Epic: none Release note: None 103033: import: skip TestImportWorkerFailure r=cucaroach a=cucaroach Informs #102839 Release note: none Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Tommy Reilly <[email protected]>
That's not the case anymore, as of #101486. |
Right but that hasn't been backported to 23.1 as far as I can tell. And wouldn't we expect to see some distsql error wrapping the underlying grpc error, like "inbox communication error"? |
Hm, I didn't see that it's 23.1. Indeed, #101486 should be ignored. However, #93218 could make some internal queries use DistSQL when they didn't previously (context is in https://cockroachlabs.slack.com/archives/C0168LW5THS/p1679603545588689?thread_ts=1679437525.752149&cid=C0168LW5THS):
I don't think it's guaranteed to always be the case. |
sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ e84e2cabef32a59d88041132f059a6768dfe56bb:
Parameters: |
I should note that this has only failed under race and deadlock and I've also seen it fail under stress. Tempting to just disable under those conditions to get the test running again but leave this bug open to try to get a grip on the legit failures. |
sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ 78dae31f503cec8e00fa2f18ed6a65da6042acbe:
Parameters: |
…deadlock Better to have the test running some of the time to catch regressions. Informs: cockroachdb#102839 Epic: None Release note: None
Removed the skipped-test label, its not skipped on 23.1 at all and skipped only for deadlock,race and stress on master. |
Informs: cockroachdb#102839 Release note: None
Informs: cockroachdb#102839 Release note: None
When this test was added five years ago, it was checked in skipped, with this comment: // TODO(mjibson): Although this test passes most of the time it still
// sometimes fails because not all kinds of failures caused by shutting a
// node down are detected and retried.
t.Skip("flakey due to undetected kinds of failures when the node is shutdown") With this history in mind, I have a proposal for this test. Maybe @dt or @stevendanna could comment on whether this makes sense:
|
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also completely unskips `TestImportWorkerFailure` so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also unskips `TestImportWorkerFailure` under stress so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None
108210: cli: add limit statment_statistics to debug zip r=j82w a=j82w This adds statement_statistics to the debug zip. It is limited to the transaction fingerprint ids in the the transaction_contention_events table. This is because the statement_statistics table maps the fingerprint to the query text. It also adds the top 100 statements by cpu usage. closes: #108180 Release note (cli change): Added limited statement_statistics to the debug zip. 108382: ccl/sqlproxyccl: serve a dirty cache whenever the watcher fails r=JeffSwenson a=jaylim-crl Previously, we will invalidate all tenant metadata entries whenever the watcher fails. This can cause issues when the directory server fails (e.g. Kubernetes API server is down). It is possible that existing SQL pods are still up, but we're invalidating the entire directory cache. We should allow incoming requests with existing SQL pods to connect to those pods. This commit addresses the issue by serving a stale cache whenever the watcher fails and not invalidating the cache. Release note: None Epic: CC-25053 108626: importer: only check import *atomicity* in TestImportWorkerFailure r=dt,yuzefovich,cucaroach a=michae2 Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None Co-authored-by: j82w <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Michael Erickson <[email protected]>
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
sql/importer.TestImportWorkerFailure failed with artifacts on release-23.1 @ 4226a83871bbce776bc9389fca5cf084b4bb7632:
Parameters:
TAGS=bazel,gss,race
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-27679
The text was updated successfully, but these errors were encountered: