importccl: restart IMPORT on worker node failure #26881

maddyblue · 2018-06-21T07:18:29Z

Attempt to detect a context canceled error in IMPORT which is caused by
a node going away in the dist SQL run. Send a special error back to the
job registry indicating a restart should happen instead of a failure.

We are shipping this with a skipped test because it is flakey. We are
ok doing that because it is still better than what we had before in many
cases, just not all. We will work to improve the other things so that we
can correctly detect when IMPORT can be restarted due to a node outage,
which will allow us to unskip this test.

Fixes #25866
Fixes #25480

Release note (bug fix): IMPORT now detects node failure and will restart
instead of fail.

cockroach-teamcity · 2018-06-21T07:18:34Z

This change is

maddyblue · 2018-06-25T20:52:06Z

PTAL @dt

dt · 2018-06-26T17:00:42Z

pkg/ccl/importccl/import_stmt_test.go

+	// TODO(mjibson): Although this test passes most of the time it still
+	// sometimes fails because not all kinds of failures caused by shutting a
+	// node down are detected and retried.
+	t.Skip()


nit: i generally like to pass a brief explanation (or filed issue #) to Skip.

Attempt to detect a context canceled error in IMPORT which is caused by a node going away in the dist SQL run. Send a special error back to the job registry indicating a restart should happen instead of a failure. We are shipping this with a skipped test because it is flakey. We are ok doing that because it is still better than what we had before in many cases, just not all. We will work to improve the other things so that we can correctly detect when IMPORT can be restarted due to a node outage, which will allow us to unskip this test. Fixes #25866 Fixes #25480 Release note (bug fix): IMPORT now detects node failure and will restart instead of fail.

maddyblue · 2018-06-26T17:24:32Z

bors r+

26811: kv: update the TCS's txn on requests way out r=andreimatei a=andreimatei The TxnCoordSender maintains a copy of the transaction record, used for things like heartbeating and creating new transactions after a TransactionAbortedError. This copy is supposed to be kept in sync with the client.Txn's copy. Before this patch, the syncing was done by updating the TCS's txn when a response comes back from a request. This patch moves to updating the TCS's txn on a request's way out, in addition to continuing to update it when a request comes back. Besides being the sane thing to do™, this assures that, if the heartbeat loop triggers before the response to the BeginTransaction's batch comes back, the transaction already has the key set. Without this patch, if the heartbeat loop triggered before the BeginTxn response, it would heartbeat key /Min, which is non-sensical (and creating load on range 0 for TPCC loadtests). Release note: None 26881: importccl: restart IMPORT on worker node failure r=mjibson a=mjibson Attempt to detect a context canceled error in IMPORT which is caused by a node going away in the dist SQL run. Send a special error back to the job registry indicating a restart should happen instead of a failure. We are shipping this with a skipped test because it is flakey. We are ok doing that because it is still better than what we had before in many cases, just not all. We will work to improve the other things so that we can correctly detect when IMPORT can be restarted due to a node outage, which will allow us to unskip this test. Fixes #25866 Fixes #25480 Release note (bug fix): IMPORT now detects node failure and will restart instead of fail. 26968: settings: bump minimum supported version to v2.0 r=nvanbenschoten a=nvanbenschoten We're currently shipping v2.1 alphas, so enforce a minimum binary version of v2.0. This ensures that no one can upgrade directly from v1.1 to v2.1. Instead, they need to make a pit sop in v2.0. Release note: None 26984: storageccl: retry SST chunks with new splits on err r=dt a=dt Simpler alternative to #26930. Closes #26930. Previously an ImportRequest would fail to add SSTables that spanned the boundries of the target range(s). This reattempts the AddSSTable call with re-chunked SSTables that avoid spanning the bounds returned in range mismatch error. It does this by iterating the SSTable to build and add smaller sstables for either side of the split. This error currently happens rarely in practice -- we usually explicitly split ranges immediately before sending an Import with matching boundsto them. Usually the empty, just-split range has no reason to split again, so the Import usually succeeds. However in some cases, like resuming a prior RESTORE, we may be re-Importing into ranges that are *not* empty and could have split at points other than those picked by the RESTORE statement. Fixes #17819. Subsumes #24299. Closes #24299. Release note: none. Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: David Taylor <[email protected]>

craig · 2018-06-26T17:45:52Z

Build succeeded

GitHub CI (Cockroach)

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also completely unskips `TestImportWorkerFailure` so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. This PR also unskips `TestImportWorkerFailure` under stress so that we can test the atomicity of imports more thoroughly. Fixes: cockroachdb#102839 Release note: None

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None

108210: cli: add limit statment_statistics to debug zip r=j82w a=j82w This adds statement_statistics to the debug zip. It is limited to the transaction fingerprint ids in the the transaction_contention_events table. This is because the statement_statistics table maps the fingerprint to the query text. It also adds the top 100 statements by cpu usage. closes: #108180 Release note (cli change): Added limited statement_statistics to the debug zip. 108382: ccl/sqlproxyccl: serve a dirty cache whenever the watcher fails r=JeffSwenson a=jaylim-crl Previously, we will invalidate all tenant metadata entries whenever the watcher fails. This can cause issues when the directory server fails (e.g. Kubernetes API server is down). It is possible that existing SQL pods are still up, but we're invalidating the entire directory cache. We should allow incoming requests with existing SQL pods to connect to those pods. This commit addresses the issue by serving a stale cache whenever the watcher fails and not invalidating the cache. Release note: None Epic: CC-25053 108626: importer: only check import *atomicity* in TestImportWorkerFailure r=dt,yuzefovich,cucaroach a=michae2 Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None Co-authored-by: j82w <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Michael Erickson <[email protected]>

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None

Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: #102839 Release note: None

Five years ago, in cockroachdb#26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey. Two months ago, in cockroachdb#105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in cockroachdb#108547. This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today. Fixes: cockroachdb#102839 Release note: None

maddyblue requested review from dt and a team June 21, 2018 07:18

dt approved these changes Jun 26, 2018

View reviewed changes

craig bot merged commit 935b986 into cockroachdb:master Jun 26, 2018

maddyblue deleted the import-retry branch June 26, 2018 18:22

tbg mentioned this pull request Jun 26, 2018

import: allow worker node failure #25480

Closed

michae2 mentioned this pull request Aug 11, 2023

sql/importer: TestImportWorkerFailure failed #102839

Closed

michae2 mentioned this pull request Aug 11, 2023

importer: only check import *atomicity* in TestImportWorkerFailure #108626

Merged

michae2 mentioned this pull request Aug 16, 2023

release-23.1: importer: only check import *atomicity* in TestImportWorkerFailure #108854

Merged

michae2 mentioned this pull request Aug 16, 2023

release-23.1.9-rc: importer: only check import *atomicity* in TestImportWorkerFailure #108855

Merged

yuzefovich mentioned this pull request Nov 8, 2023

release-22.2: importer: only check import *atomicity* in TestImportWorkerFailure #114054

Merged

yuzefovich mentioned this pull request Nov 8, 2023

release-22.2.17-rc: importer: only check import *atomicity* in TestImportWorkerFailure #114055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importccl: restart IMPORT on worker node failure #26881

importccl: restart IMPORT on worker node failure #26881

maddyblue commented Jun 21, 2018 •

edited

Loading

cockroach-teamcity commented Jun 21, 2018

maddyblue commented Jun 25, 2018

dt Jun 26, 2018

maddyblue Jun 26, 2018

maddyblue commented Jun 26, 2018

craig bot commented Jun 26, 2018

importccl: restart IMPORT on worker node failure #26881

importccl: restart IMPORT on worker node failure #26881

Conversation

maddyblue commented Jun 21, 2018 • edited Loading

cockroach-teamcity commented Jun 21, 2018

maddyblue commented Jun 25, 2018

dt Jun 26, 2018

Choose a reason for hiding this comment

maddyblue Jun 26, 2018

Choose a reason for hiding this comment

maddyblue commented Jun 26, 2018

craig bot commented Jun 26, 2018

Build succeeded

maddyblue commented Jun 21, 2018 •

edited

Loading