-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import: rolling back IMPORT INTO is slow #62799
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Interestingly, none of those look like they're actually busy doing the revert. I wonder where it is slowing down. FWIW, I implemented the mentioned optimization for empty tables in #52754. A more generalized optimization is to use time-bound scans to find the keys added, which @adityamaru implemented in #58039. Just to confirm, I think you mentioned previously, but you're doing these tests on the 21.1 branch, correct? Assuming so, you should be seeing Aditya's work. That said, those profiles don't actually so the iteration so maybe it is something else. I'll probably need to setup a reproduction to dig more, but even if we reproduce it, we likely will not be able to do anything for 21.1 this close to the release, but could potentially see if there are any easy wins for master branch / 21.2. |
Oh, huh, I just looked and I'm not sure that we actually enabled the optimization added in #58039, in non-test code: we typically add optimization paths like this initially disabled, then do production tests with it enabled and disabled before flipping the default, and it looks like we never flipped the default here. Maybe we can sneak that in. In the meantime, if you are building from source, you could try adding |
I now have a revert that seems not only slow but is not completing at all (I'm not on v21.1.1), may or may not be the same issue but per the forum @ajwerner suggested discussing it here. Here are the goroutine stacks that contain the word revert, after letting this revert run over the weekend without any node restarts or other interference. The only log lines I see that seem related to the revert are these, it seems to log this once every 6 1/2 hours.
|
A few more notes: these hosts are under light load, low cpu and disk util. Grepping for errors in the logs the only things I see of interest are some failures in the consistency checker:
And this error repeatedly:
And spurts of this from time to time:
|
I was able to finally get this particular revert to finish by setting Based on looking at the code it seems like revert failures, unlike other kinds of job failures, never get logged anywhere and don't end up being saved to the jobs table because the error there is for the failure of the original import job. It would be great if job errors were logged somewhere, since I can't say why these reverts were failing. |
We have marked this issue as stale because it has been inactive for |
Is your feature request related to a problem? Please describe.
I have observed that once there is a non-trivial amount of data in a cluster, reverting an import takes a VERY long time, much longer than the import would have taken. For example, for nodes that had a few hundred GB of data, an import that took ~1 hour took close to ~24 hours to revert when it failed. (I have observed this with a variety of schemas, but in every case importing CSV data into a single table.)
I'm assuming not much effort has been put into optimizing revert, understandably. But I could see this being a huge issue if a customer takes a downtime of a service or feature in order to do a short import (since it takes the table offline) and then finds that it's stuck reverting for hours, with nothing they can do to remediate.
It surprises me too since I would have expected that in the worst case, CRDB would simply have to iterate/rewrite all sstables that have data from the target table, erasing rows with newer timestamps. This is not a quick operation but should not take hours to do on a few hundred GB of data.
In #51512 @dt suggested an optimization for importing into empty tables but I'm hoping a more generalized improvement could be made.
FWIW here are some screenshots of what the CPU is doing on a node during a revert:
Jira issue: CRDB-2665
The text was updated successfully, but these errors were encountered: