-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: doesn't respect job cancel? CPU 100% during backup #34650
Comments
As an aside, this is still ongoing ... even after restarting machines in the cluster, the CPUs are still pegged / log entries like this are continuing.. the log entry is the same exact text over and over |
@dt @mjibson @vivekmenezes can you triage this thanks |
As a heads up, about 6 hours ago the CPU dropped back down and the log entries stopped being outputted, so this is no longer a timely issue for us. For more context: CPU had spiked for about 5 hours -- the DB is 4-5GB today and the hosts are running in the same region in GCP as the storage buckets being used. High CPU during the backup lead me to cancel the job initially. When I had canceled it had been at 100% CPU for about 20 minutes and the job was self reporting that it was at ~85% to completion -- Thought it was noteworthy that despite doing the first 85% in 20 min, the CPUs were still have 100% for another 4 hours after the job was canceled. |
It doesn't look like it from the command you posted above, but any chance this was an incremental backup ( |
Also did you reboot all of the nodes in the cluster or only the ones at 100% CPU? It is possible the job coordinator was on a node without 100% CPU, but was sending requests to the other nodes. That would explain why rebooting didn't fix that problem. |
Can you look in the logs for while it was stuck to see if you can find messages that look like |
Yes, it was the exact same key -- at least within a few minute window. This was after I had canceled the backup job that I was seeing the same key. |
@untoldone - are you still having a problem? Did restarting the last node yesterday fix it? |
Hey @untoldone. We kept looking into this and realized the bug I mentioned also affects full backups (due to some quirks lining up just wrong). We hadn't previously known this, but the fix equally applies to full backups. It was merged in #32909 and backported to v2.1.4 in #33141. Your confirmation of the same range being repeatedly logged adds extra confidence to that diagnosis. So, I can happily report that upgrading to 2.1.5 (but at least 2.1.4) mean you shouldn't see this again. We have mechanisms in place that tear down all the distributed work that was started on behalf of a job when it is canceled, but it appeared that they failed in this case. I've filed #34714 to do a followup investigation into that. |
@danhhz Awesome - sounds like this bug's been taken care of to me then unless you feel otherwise. Thanks for your help here! |
Ran an enterprise backup via
BACKUP DATABASE bloomtext_prod TO '...' AS OF SYSTEM TIME '-10s' WITH revision_history
where...
was a GCP storage backed location. CPU spiked on 2 of 3 machines in cluster to 100%. Canceled job and job now listed as canceled inshow jobs
but CPU remained at 100% in cluster. Seeing a lot of log entires that look like:Please describe the issue you observed, and any steps we can take to reproduce it:
Not sure how we specifically got this to happen in our cluster :-/
Expected behavior
On job cancelation, backup stops. It may have stopped, but the log entries make it look like it didn't stop the backup. Additionally, I suspect a backup shouldn't have brought the CPU to 100%?
Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.
If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).
Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.
If applicable, add screenshots to help explain your problem.
Environment:
CockroachDB 2.1.3, Debian 9
3 machine cluster in GCP, 300GB SSD persistent, n1-standard-2 machines
Additional context
Service outage for production service as a result of the 100% CPU, cockroach very slow at queries that otherwise returned much faster.
The text was updated successfully, but these errors were encountered: