Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: restart IMPORT on worker node failure #26881

Merged
merged 1 commit into from
Jun 26, 2018
Merged

importccl: restart IMPORT on worker node failure #26881

merged 1 commit into from
Jun 26, 2018

Conversation

maddyblue
Copy link
Contributor

@maddyblue maddyblue commented Jun 21, 2018

Attempt to detect a context canceled error in IMPORT which is caused by
a node going away in the dist SQL run. Send a special error back to the
job registry indicating a restart should happen instead of a failure.

We are shipping this with a skipped test because it is flakey. We are
ok doing that because it is still better than what we had before in many
cases, just not all. We will work to improve the other things so that we
can correctly detect when IMPORT can be restarted due to a node outage,
which will allow us to unskip this test.

Fixes #25866
Fixes #25480

Release note (bug fix): IMPORT now detects node failure and will restart
instead of fail.

@maddyblue maddyblue requested review from dt and a team June 21, 2018 07:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@maddyblue
Copy link
Contributor Author

PTAL @dt

// TODO(mjibson): Although this test passes most of the time it still
// sometimes fails because not all kinds of failures caused by shutting a
// node down are detected and retried.
t.Skip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i generally like to pass a brief explanation (or filed issue #) to Skip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Attempt to detect a context canceled error in IMPORT which is caused by
a node going away in the dist SQL run. Send a special error back to the
job registry indicating a restart should happen instead of a failure.

We are shipping this with a skipped test because it is flakey. We are
ok doing that because it is still better than what we had before in many
cases, just not all. We will work to improve the other things so that we
can correctly detect when IMPORT can be restarted due to a node outage,
which will allow us to unskip this test.

Fixes #25866
Fixes #25480

Release note (bug fix): IMPORT now detects node failure and will restart
instead of fail.
@maddyblue
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Jun 26, 2018
26811: kv: update the TCS's txn on requests way out r=andreimatei a=andreimatei

The TxnCoordSender maintains a copy of the transaction record, used for
things like heartbeating and creating new transactions after a
TransactionAbortedError. This copy is supposed to be kept in sync with
the client.Txn's copy. Before this patch, the syncing was done by
updating the TCS's txn when a response comes back from a request.
This patch moves to updating the TCS's txn on a request's way out, in
addition to continuing to update it when a request comes back.
Besides being the sane thing to do™, this assures that, if the heartbeat
loop triggers before the response to the BeginTransaction's batch comes
back, the transaction already has the key set. Without this patch, if
the heartbeat loop triggered before the BeginTxn response, it would
heartbeat key /Min, which is non-sensical (and creating load on range 0
for TPCC loadtests).

Release note: None

26881: importccl: restart IMPORT on worker node failure r=mjibson a=mjibson

Attempt to detect a context canceled error in IMPORT which is caused by
a node going away in the dist SQL run. Send a special error back to the
job registry indicating a restart should happen instead of a failure.

We are shipping this with a skipped test because it is flakey. We are
ok doing that because it is still better than what we had before in many
cases, just not all. We will work to improve the other things so that we
can correctly detect when IMPORT can be restarted due to a node outage,
which will allow us to unskip this test.

Fixes #25866
Fixes #25480

Release note (bug fix): IMPORT now detects node failure and will restart
instead of fail.

26968: settings: bump minimum supported version to v2.0 r=nvanbenschoten a=nvanbenschoten

We're currently shipping v2.1 alphas, so enforce a minimum binary
version of v2.0. This ensures that no one can upgrade directly from
v1.1 to v2.1. Instead, they need to make a pit sop in v2.0.

Release note: None

26984: storageccl: retry SST chunks with new splits on err r=dt a=dt

Simpler alternative to #26930.

Closes #26930.

Previously an ImportRequest would fail to add SSTables that spanned the
boundries of the target range(s). This reattempts the AddSSTable call
with re-chunked SSTables that avoid spanning the bounds returned in
range mismatch error. It does this by iterating the SSTable to build and
add smaller sstables for either side of the split.

This error currently happens rarely in practice -- we usually explicitly
split ranges immediately before sending an Import with matching boundsto
them. Usually the empty, just-split range has no reason to split again,
so the Import usually succeeds.

However in some cases, like resuming a prior RESTORE, we may be
re-Importing into ranges that are *not* empty and could have split at
points other than those picked by the RESTORE statement.

Fixes #17819.

Subsumes #24299.
Closes #24299.

Release note: none.

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 26, 2018

Build succeeded

@craig craig bot merged commit 935b986 into cockroachdb:master Jun 26, 2018
@maddyblue maddyblue deleted the import-retry branch June 26, 2018 18:22
michae2 added a commit to michae2/cockroach that referenced this pull request Aug 11, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

This PR also completely unskips `TestImportWorkerFailure` so that we can
test the atomicity of imports more thoroughly.

Fixes: cockroachdb#102839

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Aug 14, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

This PR also unskips `TestImportWorkerFailure` under stress so that we
can test the atomicity of imports more thoroughly.

Fixes: cockroachdb#102839

Release note: None
michae2 added a commit to michae2/cockroach that referenced this pull request Aug 14, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
craig bot pushed a commit that referenced this pull request Aug 14, 2023
108210: cli: add limit statment_statistics to debug zip r=j82w a=j82w

This adds statement_statistics to the debug zip. It is limited to the transaction fingerprint ids in the the transaction_contention_events table. This is because the statement_statistics table maps the fingerprint to the query text. It also adds the top 100 statements by cpu usage.

closes: #108180

Release note (cli change): Added limited statement_statistics to the debug zip.

108382: ccl/sqlproxyccl: serve a dirty cache whenever the watcher fails r=JeffSwenson a=jaylim-crl

Previously, we will invalidate all tenant metadata entries whenever the
watcher fails. This can cause issues when the directory server fails
(e.g. Kubernetes API server is down). It is possible that existing SQL pods
are still up, but we're invalidating the entire directory cache. We should
allow incoming requests with existing SQL pods to connect to those pods.

This commit addresses the issue by serving a stale cache whenever the watcher
fails and not invalidating the cache.

Release note: None

Epic: CC-25053

108626: importer: only check import *atomicity* in TestImportWorkerFailure r=dt,yuzefovich,cucaroach a=michae2

Five years ago, in #26881, we changed import to retry on worker failures, which made imports much more resilient to transient failures like nodes going down. As part of this work we created `TestImportWorkerFailure` which shuts down one node during an import, and checks that the import succeeded. Unfortunately, this test was checked-in skipped, because though imports were much more resilient to node failures, they were not completely resilient in every possible scenario, making the test flakey.

Two months ago, in #105712, we unskipped this test and discovered that in some cases the import statement succeeded but only imported a partial dataset. This non-atomicity seems like a bigger issue than whether the import is able to succeed in every possible transient failure scenario, and is tracked separately in #108547.

This PR changes `TestImportWorkerFailure` to remove successful import as a necessary condition for test success. Instead, the test now only checks whether the import was atomic; that is, whether a successful import imported all data or a failed import imported none. This is more in line with what we can guarantee about imports today.

Fixes: #102839

Release note: None

Co-authored-by: j82w <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
michae2 added a commit to michae2/cockroach that referenced this pull request Aug 16, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
michae2 added a commit that referenced this pull request Aug 16, 2023
Five years ago, in #26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in #105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in #108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: #102839

Release note: None
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this pull request Nov 8, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this pull request Nov 8, 2023
Five years ago, in cockroachdb#26881, we changed import to retry on worker
failures, which made imports much more resilient to transient failures
like nodes going down. As part of this work we created
`TestImportWorkerFailure` which shuts down one node during an import,
and checks that the import succeeded. Unfortunately, this test was
checked-in skipped, because though imports were much more resilient to
node failures, they were not completely resilient in every possible
scenario, making the test flakey.

Two months ago, in cockroachdb#105712, we unskipped this test and discovered that
in some cases the import statement succeeded but only imported a partial
dataset. This non-atomicity seems like a bigger issue than whether the
import is able to succeed in every possible transient failure scenario,
and is tracked separately in cockroachdb#108547.

This PR changes `TestImportWorkerFailure` to remove successful import as
a necessary condition for test success. Instead, the test now only
checks whether the import was atomic; that is, whether a successful
import imported all data or a failed import imported none. This is more
in line with what we can guarantee about imports today.

Fixes: cockroachdb#102839

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobs: retry transient errors import: allow worker node failure
3 participants