Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: doesn't respect job cancel? CPU 100% during backup #34650

Closed
untoldone opened this issue Feb 6, 2019 · 11 comments
Closed

backupccl: doesn't respect job cancel? CPU 100% during backup #34650

untoldone opened this issue Feb 6, 2019 · 11 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@untoldone
Copy link

untoldone commented Feb 6, 2019

Ran an enterprise backup via BACKUP DATABASE bloomtext_prod TO '...' AS OF SYSTEM TIME '-10s' WITH revision_history where ... was a GCP storage backed location. CPU spiked on 2 of 3 machines in cluster to 100%. Canceled job and job now listed as canceled in show jobs but CPU remained at 100% in cluster. Seeing a lot of log entires that look like:

I190206 06:23:58.666311 295359 ccl/storageccl/export.go:124  [n1,s1,r1867/1:/Table/1362/{4/NULL…-5/NULL…}] export [/Table/1362/4/NULL/"<redacted>",/Table/1362/5)

Please describe the issue you observed, and any steps we can take to reproduce it:

Not sure how we specifically got this to happen in our cluster :-/

Expected behavior

On job cancelation, backup stops. It may have stopped, but the log entries make it look like it didn't stop the backup. Additionally, I suspect a backup shouldn't have brought the CPU to 100%?

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:
CockroachDB 2.1.3, Debian 9
3 machine cluster in GCP, 300GB SSD persistent, n1-standard-2 machines

Additional context
Service outage for production service as a result of the 100% CPU, cockroach very slow at queries that otherwise returned much faster.

@untoldone
Copy link
Author

untoldone commented Feb 6, 2019

As an aside, this is still ongoing ... even after restarting machines in the cluster, the CPUs are still pegged / log entries like this are continuing.. the log entry is the same exact text over and over

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-disaster-recovery labels Feb 6, 2019
@knz
Copy link
Contributor

knz commented Feb 6, 2019

@dt @mjibson @vivekmenezes can you triage this thanks

@knz knz changed the title Enterprise backup doesn't respect job cancel? CPU 100% during backup backupccl: doesn't respect job cancel? CPU 100% during backup Feb 6, 2019
@untoldone
Copy link
Author

As a heads up, about 6 hours ago the CPU dropped back down and the log entries stopped being outputted, so this is no longer a timely issue for us. For more context: CPU had spiked for about 5 hours -- the DB is 4-5GB today and the hosts are running in the same region in GCP as the storage buckets being used.

High CPU during the backup lead me to cancel the job initially. When I had canceled it had been at 100% CPU for about 20 minutes and the job was self reporting that it was at ~85% to completion -- Thought it was noteworthy that despite doing the first 85% in 20 min, the CPUs were still have 100% for another 4 hours after the job was canceled.

@danhhz
Copy link
Contributor

danhhz commented Feb 6, 2019

It doesn't look like it from the command you posted above, but any chance this was an incremental backup (...INCREMENTAL FROM...)? We fixed a bug in 2.1.4 that would have looked exactly like this, but it only affects incremental backups.

@maddyblue
Copy link
Contributor

Also did you reboot all of the nodes in the cluster or only the ones at 100% CPU? It is possible the job coordinator was on a node without 100% CPU, but was sending requests to the other nodes. That would explain why rebooting didn't fix that problem.

@untoldone
Copy link
Author

untoldone commented Feb 6, 2019

@mjibson Actually just checked, I thought I restarted all 3 but I only restarted the 2 machines with 100%. @danhhz The command was copy-pasted from the job list (with the path removed) -- I also just double checked to be sure.

@dt
Copy link
Member

dt commented Feb 6, 2019

Can you look in the logs for while it was stuck to see if you can find messages that look like export [/Table/1362/4/NULL/"<redacted>",/Table/1362/5) appearing repeatedly with the same key range (i.e. that would indicate it is retrying the same request over and over)?

@untoldone
Copy link
Author

Yes, it was the exact same key -- at least within a few minute window. This was after I had canceled the backup job that I was seeing the same key.

@tim-o
Copy link
Contributor

tim-o commented Feb 7, 2019

@untoldone - are you still having a problem? Did restarting the last node yesterday fix it?

@danhhz
Copy link
Contributor

danhhz commented Feb 7, 2019

Hey @untoldone. We kept looking into this and realized the bug I mentioned also affects full backups (due to some quirks lining up just wrong). We hadn't previously known this, but the fix equally applies to full backups. It was merged in #32909 and backported to v2.1.4 in #33141. Your confirmation of the same range being repeatedly logged adds extra confidence to that diagnosis. So, I can happily report that upgrading to 2.1.5 (but at least 2.1.4) mean you shouldn't see this again.

We have mechanisms in place that tear down all the distributed work that was started on behalf of a job when it is canceled, but it appeared that they failed in this case. I've filed #34714 to do a followup investigation into that.

@untoldone
Copy link
Author

@danhhz Awesome - sounds like this bug's been taken care of to me then unless you feel otherwise. Thanks for your help here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

6 participants