import: rolling back IMPORT INTO is slow #62799

dankinder · 2021-03-30T15:25:32Z

Is your feature request related to a problem? Please describe.
I have observed that once there is a non-trivial amount of data in a cluster, reverting an import takes a VERY long time, much longer than the import would have taken. For example, for nodes that had a few hundred GB of data, an import that took ~1 hour took close to ~24 hours to revert when it failed. (I have observed this with a variety of schemas, but in every case importing CSV data into a single table.)

I'm assuming not much effort has been put into optimizing revert, understandably. But I could see this being a huge issue if a customer takes a downtime of a service or feature in order to do a short import (since it takes the table offline) and then finds that it's stuck reverting for hours, with nothing they can do to remediate.

It surprises me too since I would have expected that in the worst case, CRDB would simply have to iterate/rewrite all sstables that have data from the target table, erasing rows with newer timestamps. This is not a quick operation but should not take hours to do on a few hundred GB of data.

In #51512 @dt suggested an optimization for importing into empty tables but I'm hoping a more generalized improvement could be made.

FWIW here are some screenshots of what the CPU is doing on a node during a revert:

Jira issue: CRDB-2665

blathers-crl · 2021-03-30T15:25:35Z

Hello, I am Blathers. I am here to help you get the issue triaged.

I have CC'd a few people who may be able to assist you:

@dt (author of import: rolling back failed IMPORT INTO into empty table is slow #51512)
@pbardea (commented on import: rolling back failed IMPORT INTO into empty table is slow #51512)
@cockroachdb/bulk-io (found keywords: import)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

dt · 2021-03-30T15:46:18Z

Interestingly, none of those look like they're actually busy doing the revert. I wonder where it is slowing down.

FWIW, I implemented the mentioned optimization for empty tables in #52754. A more generalized optimization is to use time-bound scans to find the keys added, which @adityamaru implemented in #58039.

Just to confirm, I think you mentioned previously, but you're doing these tests on the 21.1 branch, correct? Assuming so, you should be seeing Aditya's work. That said, those profiles don't actually so the iteration so maybe it is something else. I'll probably need to setup a reproduction to dig more, but even if we reproduce it, we likely will not be able to do anything for 21.1 this close to the release, but could potentially see if there are any easy wins for master branch / 21.2.

dt · 2021-03-30T15:56:07Z

Oh, huh, I just looked and I'm not sure that we actually enabled the optimization added in #58039, in non-test code: we typically add optimization paths like this initially disabled, then do production tests with it enabled and disabled before flipping the default, and it looks like we never flipped the default here. Maybe we can sneak that in.

In the meantime, if you are building from source, you could try adding EnableTimeBoundIteratorOptimization: true here

dankinder · 2021-06-01T16:17:52Z

I now have a revert that seems not only slow but is not completing at all (I'm not on v21.1.1), may or may not be the same issue but per the forum @ajwerner suggested discussing it here.

Here are the goroutine stacks that contain the word revert, after letting this revert run over the weekend without any node restarts or other interference.
thestacks.txt

The only log lines I see that seem related to the revert are these, it seems to log this once every 6 1/2 hours.

I210531 12:18:57.567474 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 35188  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210531 18:54:47.929845 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 38413  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210601 01:29:42.477554 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 41625  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210601 08:04:08.426072 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 44835  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210601 14:40:46.605632 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 48082  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647

dankinder · 2021-06-01T19:29:44Z

A few more notes: these hosts are under light load, low cpu and disk util. Grepping for errors in the logs the only things I see of interest are some failures in the consistency checker:

E210531 19:22:20.128792 69257521 kv/kvserver/replica_proposal.go:247 ⋮ [n16,s96,r519138/6:‹/Table/73/1/"la{dying…-karti…}›] 40966  context deadline exceeded
E210531 19:22:20.129030 69257969 kv/kvserver/consistency_queue.go:191 ⋮ [n16,consistencyChecker,s96,r519138/6:‹/Table/73/1/"la{dying…-karti…}›] 40967  computing own checksum: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
E210531 19:22:20.129211 69257969 kv/kvserver/queue.go:1093 ⋮ [n16,consistencyChecker,s96,r519138/6:‹/Table/73/1/"la{dying…-karti…}›] 40968  ‹operation "consistencyChecker queue process replica 519138" timed out after 1m50s›: computing own checksum: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›

And this error repeatedly:

I210602 17:08:14.797872 13976 server/auto_upgrade.go:52 ⋮ [n24] 62109  failed attempt to upgrade cluster version, error: ‹show-version›: ‹operation "show cluster setting version" timed out after 2m0s›: value differs between local setting (‹[18 8 8 20 16 2 24 0 32 22]›) and KV (‹[18 8 8 20 16 2 24 0 32 0]›); try again later (<nil> after 1m59.481074415s)

And spurts of this from time to time:

E210531 03:33:22.794900 54630402 kv/kvserver/queue.go:1093 ⋮ [n6,merge,s36,r387964/21:‹/Table/54/2/"korea.ac.kr"/"li…›] 31647  ‹none of the remaining voters [n4,s20] are legal additions to (n6,s36):21,(n4,s24):6,(n10,s56):20›

dankinder · 2021-06-03T12:58:52Z

I was able to finally get this particular revert to finish by setting kv.bulk_io_write.revert_range_time_bound_iterator.enabled to false. After 5 days it was still going (presumably just restarting over and over, based on the logs I pasted), until I set that to false, then it finished within the next 12 hours.

Based on looking at the code it seems like revert failures, unlike other kinds of job failures, never get logged anywhere and don't end up being saved to the jobs table because the error there is for the failure of the original import job. It would be great if job errors were logged somewhere, since I can't say why these reverts were failing.

github-actions · 2023-08-30T11:08:54Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

dankinder added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 30, 2021

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Mar 30, 2021

dt changed the title ~~Speed up revert of import~~ import: rolling back IMPORT INTO is slow Mar 30, 2021

otan mentioned this issue Apr 5, 2021

issues: brush up & make issues fun to improve #63027

Merged

dankinder mentioned this issue Jun 2, 2021

streamingccl: add minimum flush interval cluster setting #60160

Merged

github-actions bot added the no-issue-activity label Aug 30, 2023

github-actions bot added the X-stale label Sep 11, 2023

github-actions bot closed this as completed Sep 11, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import: rolling back IMPORT INTO is slow #62799

import: rolling back IMPORT INTO is slow #62799

dankinder commented Mar 30, 2021 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Mar 30, 2021 •

edited by otan

Loading

dt commented Mar 30, 2021

dt commented Mar 30, 2021

dankinder commented Jun 1, 2021

dankinder commented Jun 1, 2021 •

edited

Loading

dankinder commented Jun 3, 2021

github-actions bot commented Aug 30, 2023

import: rolling back IMPORT INTO is slow #62799

import: rolling back IMPORT INTO is slow #62799

Comments

dankinder commented Mar 30, 2021 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Mar 30, 2021 • edited by otan Loading

dt commented Mar 30, 2021

dt commented Mar 30, 2021

dankinder commented Jun 1, 2021

dankinder commented Jun 1, 2021 • edited Loading

dankinder commented Jun 3, 2021

github-actions bot commented Aug 30, 2023

dankinder commented Mar 30, 2021 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Mar 30, 2021 •

edited by otan

Loading

dankinder commented Jun 1, 2021 •

edited

Loading