Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled backup doesn't complete #63526

Closed
charsleysa opened this issue Apr 13, 2021 · 3 comments
Closed

Scheduled backup doesn't complete #63526

charsleysa opened this issue Apr 13, 2021 · 3 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@charsleysa
Copy link

Describe the problem
Hi there, asked in Slack and advised to log issue here.

Having an issue where scheduled backups aren't completing, they get stuck at ~96% with 1 second remaining. There are no long running queries which could be blocking but I would also expect the backup to take priority over any running queries.

To Reproduce

Unknown

Expected behavior
Backup to finish in reasonable time (usually 2 minutes).

Additional data / screenshots
Debug zip supplied to [email protected]

Environment:

  • CockroachDB version 20.2.5
  • Server OS: Official Docker Image
  • Client app: N/A

Additional context
Tried to decommission and replace the node that looked like was stuck and the scheduled backup job simply failed and restarted with the same issues on another node.

Unscheduled backups also don't work so no backup seems possible which is a high risk issue.

@charsleysa charsleysa added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 13, 2021
@blathers-crl
Copy link

blathers-crl bot commented Apr 13, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/bulk-io (found keywords: backup)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Apr 13, 2021
@dt
Copy link
Member

dt commented Apr 13, 2021

Unfortunately the debug.zip failed to collect most of the files including the jobs details (you can see most files are empty and have a companion file named .err.txt they says it timed out), but fortunately what it did capture included the node status map, which shows 10k+ intents on many nodes (e.g. see "intentcount": 16701, in nodes.json).

My guess is this is the known issue where some workloads can cause a leak/build-up of unresolved intents that prevent BACKUP from reading, tracked in #59704. In the short-term, running a BEGIN PRIORITY HIGH; SELECT COUNT(*) FROM my_tbl; COMMIT; where my_tbl is any table that has excess intents may help clear them up and get the backups un-stuck for now. In the longer term, the real fix for #59704 is in the works.

I'm going to close this as a dupe of #59704, but please comment/re-open if the high-priority select doesn't help.

@dt dt closed this as completed Apr 13, 2021
@charsleysa
Copy link
Author

Hi @dt

Thanks for your help! The workaround fixed our stuck backups.

It seems that a table which had severe contention for some days last week had caused the number of intents to skyrocket up to almost 9 million and even today were still at 300 thousand.

After running the high priority transaction on the table the intents cleared and the backups proceeded successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

2 participants