-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: restart IMPORT on worker node failure #26881
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PTAL @dt |
dt
approved these changes
Jun 26, 2018
// TODO(mjibson): Although this test passes most of the time it still | ||
// sometimes fails because not all kinds of failures caused by shutting a | ||
// node down are detected and retried. | ||
t.Skip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i generally like to pass a brief explanation (or filed issue #) to Skip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Attempt to detect a context canceled error in IMPORT which is caused by a node going away in the dist SQL run. Send a special error back to the job registry indicating a restart should happen instead of a failure. We are shipping this with a skipped test because it is flakey. We are ok doing that because it is still better than what we had before in many cases, just not all. We will work to improve the other things so that we can correctly detect when IMPORT can be restarted due to a node outage, which will allow us to unskip this test. Fixes #25866 Fixes #25480 Release note (bug fix): IMPORT now detects node failure and will restart instead of fail.
bors r+ |
craig bot
pushed a commit
that referenced
this pull request
Jun 26, 2018
26811: kv: update the TCS's txn on requests way out r=andreimatei a=andreimatei The TxnCoordSender maintains a copy of the transaction record, used for things like heartbeating and creating new transactions after a TransactionAbortedError. This copy is supposed to be kept in sync with the client.Txn's copy. Before this patch, the syncing was done by updating the TCS's txn when a response comes back from a request. This patch moves to updating the TCS's txn on a request's way out, in addition to continuing to update it when a request comes back. Besides being the sane thing to do™, this assures that, if the heartbeat loop triggers before the response to the BeginTransaction's batch comes back, the transaction already has the key set. Without this patch, if the heartbeat loop triggered before the BeginTxn response, it would heartbeat key /Min, which is non-sensical (and creating load on range 0 for TPCC loadtests). Release note: None 26881: importccl: restart IMPORT on worker node failure r=mjibson a=mjibson Attempt to detect a context canceled error in IMPORT which is caused by a node going away in the dist SQL run. Send a special error back to the job registry indicating a restart should happen instead of a failure. We are shipping this with a skipped test because it is flakey. We are ok doing that because it is still better than what we had before in many cases, just not all. We will work to improve the other things so that we can correctly detect when IMPORT can be restarted due to a node outage, which will allow us to unskip this test. Fixes #25866 Fixes #25480 Release note (bug fix): IMPORT now detects node failure and will restart instead of fail. 26968: settings: bump minimum supported version to v2.0 r=nvanbenschoten a=nvanbenschoten We're currently shipping v2.1 alphas, so enforce a minimum binary version of v2.0. This ensures that no one can upgrade directly from v1.1 to v2.1. Instead, they need to make a pit sop in v2.0. Release note: None 26984: storageccl: retry SST chunks with new splits on err r=dt a=dt Simpler alternative to #26930. Closes #26930. Previously an ImportRequest would fail to add SSTables that spanned the boundries of the target range(s). This reattempts the AddSSTable call with re-chunked SSTables that avoid spanning the bounds returned in range mismatch error. It does this by iterating the SSTable to build and add smaller sstables for either side of the split. This error currently happens rarely in practice -- we usually explicitly split ranges immediately before sending an Import with matching boundsto them. Usually the empty, just-split range has no reason to split again, so the Import usually succeeds. However in some cases, like resuming a prior RESTORE, we may be re-Importing into ranges that are *not* empty and could have split at points other than those picked by the RESTORE statement. Fixes #17819. Subsumes #24299. Closes #24299. Release note: none. Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: David Taylor <[email protected]>
Build succeeded |
michae2
added a commit
to michae2/cockroach
that referenced
this pull request
Aug 11, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also completely unskips `TestImportWorkerFailure` so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None
michae2
added a commit
to michae2/cockroach
that referenced
this pull request
Aug 14, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also unskips `TestImportWorkerFailure` under stress so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None
michae2
added a commit
to michae2/cockroach
that referenced
this pull request
Aug 14, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
craig bot
pushed a commit
that referenced
this pull request
Aug 14, 2023
108210: cli: add limit statment_statistics to debug zip r=j82w a=j82w This adds statement_statistics to the debug zip. It is limited to the transaction fingerprint ids in the the transaction_contention_events table. This is because the statement_statistics table maps the fingerprint to the query text. It also adds the top 100 statements by cpu usage. closes: #108180 Release note (cli change): Added limited statement_statistics to the debug zip. 108382: ccl/sqlproxyccl: serve a dirty cache whenever the watcher fails r=JeffSwenson a=jaylim-crl Previously, we will invalidate all tenant metadata entries whenever the watcher fails. This can cause issues when the directory server fails (e.g. Kubernetes API server is down). It is possible that existing SQL pods are still up, but we're invalidating the entire directory cache. We should allow incoming requests with existing SQL pods to connect to those pods. This commit addresses the issue by serving a stale cache whenever the watcher fails and not invalidating the cache. Release note: None Epic: CC-25053 108626: importer: only check import *atomicity* in TestImportWorkerFailure r=dt,yuzefovich,cucaroach a=michae2 Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None Co-authored-by: j82w <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Michael Erickson <[email protected]>
michae2
added a commit
to michae2/cockroach
that referenced
this pull request
Aug 16, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
michae2
added a commit
that referenced
this pull request
Aug 16, 2023
Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None
yuzefovich
pushed a commit
to yuzefovich/cockroach
that referenced
this pull request
Nov 8, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
yuzefovich
pushed a commit
to yuzefovich/cockroach
that referenced
this pull request
Nov 8, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Attempt to detect a context canceled error in IMPORT which is caused by
a node going away in the dist SQL run. Send a special error back to the
job registry indicating a restart should happen instead of a failure.
We are shipping this with a skipped test because it is flakey. We are
ok doing that because it is still better than what we had before in many
cases, just not all. We will work to improve the other things so that we
can correctly detect when IMPORT can be restarted due to a node outage,
which will allow us to unskip this test.
Fixes #25866
Fixes #25480
Release note (bug fix): IMPORT now detects node failure and will restart
instead of fail.