Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: rolling back IMPORT INTO is slow #62799

Closed
dankinder opened this issue Mar 30, 2021 · 7 comments
Closed

import: rolling back IMPORT INTO is slow #62799

dankinder opened this issue Mar 30, 2021 · 7 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity O-community Originated from the community X-blathers-triaged blathers was able to find an owner X-stale

Comments

@dankinder
Copy link

dankinder commented Mar 30, 2021

Is your feature request related to a problem? Please describe.
I have observed that once there is a non-trivial amount of data in a cluster, reverting an import takes a VERY long time, much longer than the import would have taken. For example, for nodes that had a few hundred GB of data, an import that took ~1 hour took close to ~24 hours to revert when it failed. (I have observed this with a variety of schemas, but in every case importing CSV data into a single table.)

I'm assuming not much effort has been put into optimizing revert, understandably. But I could see this being a huge issue if a customer takes a downtime of a service or feature in order to do a short import (since it takes the table offline) and then finds that it's stuck reverting for hours, with nothing they can do to remediate.

It surprises me too since I would have expected that in the worst case, CRDB would simply have to iterate/rewrite all sstables that have data from the target table, erasing rows with newer timestamps. This is not a quick operation but should not take hours to do on a few hundred GB of data.

In #51512 @dt suggested an optimization for importing into empty tables but I'm hoping a more generalized improvement could be made.

FWIW here are some screenshots of what the CPU is doing on a node during a revert:
Screen Shot 2021-03-30 at 11 23 19 AM
Screen Shot 2021-03-30 at 11 24 38 AM
Screen Shot 2021-03-30 at 11 25 07 AM

Jira issue: CRDB-2665

@dankinder dankinder added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 30, 2021
@blathers-crl
Copy link

blathers-crl bot commented Mar 30, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Mar 30, 2021
@dt dt changed the title Speed up revert of import import: rolling back IMPORT INTO is slow Mar 30, 2021
@dt
Copy link
Member

dt commented Mar 30, 2021

Interestingly, none of those look like they're actually busy doing the revert. I wonder where it is slowing down.

FWIW, I implemented the mentioned optimization for empty tables in #52754. A more generalized optimization is to use time-bound scans to find the keys added, which @adityamaru implemented in #58039.

Just to confirm, I think you mentioned previously, but you're doing these tests on the 21.1 branch, correct? Assuming so, you should be seeing Aditya's work. That said, those profiles don't actually so the iteration so maybe it is something else. I'll probably need to setup a reproduction to dig more, but even if we reproduce it, we likely will not be able to do anything for 21.1 this close to the release, but could potentially see if there are any easy wins for master branch / 21.2.

@dt
Copy link
Member

dt commented Mar 30, 2021

Oh, huh, I just looked and I'm not sure that we actually enabled the optimization added in #58039, in non-test code: we typically add optimization paths like this initially disabled, then do production tests with it enabled and disabled before flipping the default, and it looks like we never flipped the default here. Maybe we can sneak that in.

In the meantime, if you are building from source, you could try adding EnableTimeBoundIteratorOptimization: true here

@dankinder
Copy link
Author

I now have a revert that seems not only slow but is not completing at all (I'm not on v21.1.1), may or may not be the same issue but per the forum @ajwerner suggested discussing it here.

Here are the goroutine stacks that contain the word revert, after letting this revert run over the weekend without any node restarts or other interference.
thestacks.txt

The only log lines I see that seem related to the revert are these, it seems to log this once every 6 1/2 hours.

I210531 12:18:57.567474 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 35188  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210531 18:54:47.929845 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 38413  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210601 01:29:42.477554 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 41625  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210601 08:04:08.426072 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 44835  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647
I210601 14:40:46.605632 91554 sql/revert.go:89 ⋮ [n24,job=‹659740178040946689›] 48082  reverting table ‹links_importing› (54) to time 1621407124.604026590,2147483647

@dankinder
Copy link
Author

dankinder commented Jun 1, 2021

A few more notes: these hosts are under light load, low cpu and disk util. Grepping for errors in the logs the only things I see of interest are some failures in the consistency checker:

E210531 19:22:20.128792 69257521 kv/kvserver/replica_proposal.go:247 ⋮ [n16,s96,r519138/6:‹/Table/73/1/"la{dying…-karti…}›] 40966  context deadline exceeded
E210531 19:22:20.129030 69257969 kv/kvserver/consistency_queue.go:191 ⋮ [n16,consistencyChecker,s96,r519138/6:‹/Table/73/1/"la{dying…-karti…}›] 40967  computing own checksum: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
E210531 19:22:20.129211 69257969 kv/kvserver/queue.go:1093 ⋮ [n16,consistencyChecker,s96,r519138/6:‹/Table/73/1/"la{dying…-karti…}›] 40968  ‹operation "consistencyChecker queue process replica 519138" timed out after 1m50s›: computing own checksum: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›

And this error repeatedly:

I210602 17:08:14.797872 13976 server/auto_upgrade.go:52 ⋮ [n24] 62109  failed attempt to upgrade cluster version, error: ‹show-version›: ‹operation "show cluster setting version" timed out after 2m0s›: value differs between local setting (‹[18 8 8 20 16 2 24 0 32 22]›) and KV (‹[18 8 8 20 16 2 24 0 32 0]›); try again later (<nil> after 1m59.481074415s)

And spurts of this from time to time:

E210531 03:33:22.794900 54630402 kv/kvserver/queue.go:1093 ⋮ [n6,merge,s36,r387964/21:‹/Table/54/2/"korea.ac.kr"/"li…›] 31647  ‹none of the remaining voters [n4,s20] are legal additions to (n6,s36):21,(n4,s24):6,(n10,s56):20›

@dankinder
Copy link
Author

I was able to finally get this particular revert to finish by setting kv.bulk_io_write.revert_range_time_bound_iterator.enabled to false. After 5 days it was still going (presumably just restarting over and over, based on the logs I pasted), until I set that to false, then it finished within the next 12 hours.

Based on looking at the code it seems like revert failures, unlike other kinds of job failures, never get logged anywhere and don't end up being saved to the jobs table because the error there is for the failure of the original import job. It would be great if job errors were logged somewhere, since I can't say why these reverts were failing.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity O-community Originated from the community X-blathers-triaged blathers was able to find an owner X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants