Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored #70173

Closed
adityamaru opened this issue Sep 14, 2021 · 7 comments
Assignees
Labels
A-disaster-recovery A-zone-configs branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Sep 14, 2021

Describe the problem

A full cluster backup will backup all jobs (including automatic jobs) in the cluster. A cluster restore into a fresh cluster will restore the jobs from the backup into the system table of the cluster being restored into.

The AUTO SPAN CONFIG job is automatically created on cluster startup. Performing a full cluster restore, will result in 2 such entries for the AUTO SPAN CONFIG job. One started by the restoring cluster, and one from the backup.

Screen Shot 2021-09-13 at 7 58 58 PM

To Reproduce

  1. Start a single node cluster
  2. BACKUP INTO 'nodelocal://0/foo'
  3. Shut down the cluster
  4. Start a fresh single node cluster that uses the same cockroach-data/extern directory.
  5. RESTORE FROM LATEST IN 'nodelocal://0/foo'
  6. SHOW AUTOMATIC JOBS

Expected behavior
Semantics need to be ironed out for cluster backup and restore performed in a dedicated environment, and also as a secondary tenant.

BACKUP TENANT and RESTORE TENANT performed by the system tenant should be okay and only result in a single entry for the span config job. This is because a tenant restore runs in the host tenants' registry and simply writes all keys from [TenantPrefix, TenantPrefix.End] in a newly created, empty tenant. The creation of a tenant does not lead to a AUTO SPAN CONFIG job being triggered, therefore the restored job entry should be the only one of that kind.

Environment:

  • CockroachDB version: 21.2 onwards

Epic CRDB-8816

Jira issue: CRDB-9970

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-zone-configs A-disaster-recovery labels Sep 14, 2021
@irfansharif
Copy link
Contributor

I think we want to simply exclude backing up certain kinds of jobs, notably these automatic ones. Do we want to do the same for #68434?

@adityamaru
Copy link
Contributor Author

Yeah still trying to get a repro on 68434, but for some reason, I don't see duplicate schedules on restore. Either I'm doing something wrong or we already have something in place, will check it out tomorrow.

@adityamaru
Copy link
Contributor Author

adityamaru commented Sep 14, 2021

I'm not very familiar with the span config job but is there checkpointed state that needs to be resumed from on restore into a fresh cluster? If we don't back it up, and simply rely on the new one created on the restoring cluster, will they reconcile to the same state as on the cluster that ran the backup, post restore?

@irfansharif
Copy link
Contributor

There's no checkpointing, not yet -- right now it's just a scaffold of a job. When restoring, it'd be fine to discard the checkpointed state if any.

@dt
Copy link
Member

dt commented Sep 20, 2021

My vote is we modify the job to, on Resume(), check if there is a duplicate and if so, choose to either exit or cancel the other one. Right now there's no persisted state, so it's maybe fine to just say that RESTORE always cancels the restored one, but in the future, if that changed, and there were some state in the job, it isn't clear that the restore job is always the one we want to discard, so I'd rather the job itself make the choice of which one it keeps?

@blathers-crl
Copy link

blathers-crl bot commented Mar 22, 2022

Hi @irfansharif, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@irfansharif irfansharif added the branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 label Mar 22, 2022
@arulajmani
Copy link
Collaborator

Capturing a conversation elsewhere, @adityamaru mentioned that #75060 might be addressed with this issue as well. As such, we should be able to run TestFullClusterBackup with the span config infrastructure once this issue is closed as well.

@dt dt changed the title backupccl: full cluster restore results in more than one AUTO SPAN CONFIG job spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored Mar 31, 2022
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 1, 2022
Fixes cockroachdb#70173. When restoring a cluster, we don't want to end up with two
instances of the singleton reconciliation job.

Release note: None
@craig craig bot closed this as completed in 173087a Apr 2, 2022
blathers-crl bot pushed a commit that referenced this issue Apr 2, 2022
Fixes #70173. When restoring a cluster, we don't want to end up with two
instances of the singleton reconciliation job.

Release note: None
fqazi pushed a commit to fqazi/cockroach that referenced this issue Apr 4, 2022
Fixes cockroachdb#70173. When restoring a cluster, we don't want to end up with two
instances of the singleton reconciliation job.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-zone-configs branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants