backupccl: doesn't respect job cancel? CPU 100% during backup #34650

untoldone · 2019-02-06T06:29:27Z

Ran an enterprise backup via BACKUP DATABASE bloomtext_prod TO '...' AS OF SYSTEM TIME '-10s' WITH revision_history where ... was a GCP storage backed location. CPU spiked on 2 of 3 machines in cluster to 100%. Canceled job and job now listed as canceled in show jobs but CPU remained at 100% in cluster. Seeing a lot of log entires that look like:

I190206 06:23:58.666311 295359 ccl/storageccl/export.go:124  [n1,s1,r1867/1:/Table/1362/{4/NULL…-5/NULL…}] export [/Table/1362/4/NULL/"<redacted>",/Table/1362/5)

Please describe the issue you observed, and any steps we can take to reproduce it:

Not sure how we specifically got this to happen in our cluster :-/

Expected behavior

On job cancelation, backup stops. It may have stopped, but the log entries make it look like it didn't stop the backup. Additionally, I suspect a backup shouldn't have brought the CPU to 100%?

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:
CockroachDB 2.1.3, Debian 9
3 machine cluster in GCP, 300GB SSD persistent, n1-standard-2 machines

Additional context
Service outage for production service as a result of the 100% CPU, cockroach very slow at queries that otherwise returned much faster.

The text was updated successfully, but these errors were encountered:

untoldone · 2019-02-06T06:38:05Z

As an aside, this is still ongoing ... even after restarting machines in the cluster, the CPUs are still pegged / log entries like this are continuing.. the log entry is the same exact text over and over

knz · 2019-02-06T09:26:53Z

@dt @mjibson @vivekmenezes can you triage this thanks

untoldone · 2019-02-06T16:38:41Z

As a heads up, about 6 hours ago the CPU dropped back down and the log entries stopped being outputted, so this is no longer a timely issue for us. For more context: CPU had spiked for about 5 hours -- the DB is 4-5GB today and the hosts are running in the same region in GCP as the storage buckets being used.

High CPU during the backup lead me to cancel the job initially. When I had canceled it had been at 100% CPU for about 20 minutes and the job was self reporting that it was at ~85% to completion -- Thought it was noteworthy that despite doing the first 85% in 20 min, the CPUs were still have 100% for another 4 hours after the job was canceled.

danhhz · 2019-02-06T19:52:26Z

It doesn't look like it from the command you posted above, but any chance this was an incremental backup (...INCREMENTAL FROM...)? We fixed a bug in 2.1.4 that would have looked exactly like this, but it only affects incremental backups.

maddyblue · 2019-02-06T19:59:16Z

Also did you reboot all of the nodes in the cluster or only the ones at 100% CPU? It is possible the job coordinator was on a node without 100% CPU, but was sending requests to the other nodes. That would explain why rebooting didn't fix that problem.

untoldone · 2019-02-06T20:33:14Z

@mjibson Actually just checked, I thought I restarted all 3 but I only restarted the 2 machines with 100%. @danhhz The command was copy-pasted from the job list (with the path removed) -- I also just double checked to be sure.

dt · 2019-02-06T20:41:27Z

Can you look in the logs for while it was stuck to see if you can find messages that look like export [/Table/1362/4/NULL/"<redacted>",/Table/1362/5) appearing repeatedly with the same key range (i.e. that would indicate it is retrying the same request over and over)?

untoldone · 2019-02-06T22:01:45Z

Yes, it was the exact same key -- at least within a few minute window. This was after I had canceled the backup job that I was seeing the same key.

tim-o · 2019-02-07T14:29:54Z

@untoldone - are you still having a problem? Did restarting the last node yesterday fix it?

danhhz · 2019-02-07T16:19:21Z

Hey @untoldone. We kept looking into this and realized the bug I mentioned also affects full backups (due to some quirks lining up just wrong). We hadn't previously known this, but the fix equally applies to full backups. It was merged in #32909 and backported to v2.1.4 in #33141. Your confirmation of the same range being repeatedly logged adds extra confidence to that diagnosis. So, I can happily report that upgrading to 2.1.5 (but at least 2.1.4) mean you shouldn't see this again.

We have mechanisms in place that tear down all the distributed work that was started on behalf of a job when it is canceled, but it appeared that they failed in this case. I've filed #34714 to do a followup investigation into that.

untoldone · 2019-02-07T18:34:23Z

@danhhz Awesome - sounds like this bug's been taken care of to me then unless you feel otherwise. Thanks for your help here!

knz changed the title ~~Enterprise backup doesn't respect job cancel? CPU 100% during backup~~ backupccl: doesn't respect job cancel? CPU 100% during backup Feb 6, 2019

danhhz mentioned this issue Feb 7, 2019

backupccl: canceled job didn't stop Export retry loop #34714

Closed

untoldone closed this as completed Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: doesn't respect job cancel? CPU 100% during backup #34650

backupccl: doesn't respect job cancel? CPU 100% during backup #34650

untoldone commented Feb 6, 2019 •

edited

Loading

untoldone commented Feb 6, 2019 •

edited

Loading

knz commented Feb 6, 2019

untoldone commented Feb 6, 2019

danhhz commented Feb 6, 2019

maddyblue commented Feb 6, 2019

untoldone commented Feb 6, 2019 •

edited

Loading

dt commented Feb 6, 2019

untoldone commented Feb 6, 2019

tim-o commented Feb 7, 2019

danhhz commented Feb 7, 2019

untoldone commented Feb 7, 2019

backupccl: doesn't respect job cancel? CPU 100% during backup #34650

backupccl: doesn't respect job cancel? CPU 100% during backup #34650

Comments

untoldone commented Feb 6, 2019 • edited Loading

untoldone commented Feb 6, 2019 • edited Loading

knz commented Feb 6, 2019

untoldone commented Feb 6, 2019

danhhz commented Feb 6, 2019

maddyblue commented Feb 6, 2019

untoldone commented Feb 6, 2019 • edited Loading

dt commented Feb 6, 2019

untoldone commented Feb 6, 2019

tim-o commented Feb 7, 2019

danhhz commented Feb 7, 2019

untoldone commented Feb 7, 2019

untoldone commented Feb 6, 2019 •

edited

Loading

untoldone commented Feb 6, 2019 •

edited

Loading

untoldone commented Feb 6, 2019 •

edited

Loading