spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored #70173

adityamaru · 2021-09-14T00:51:12Z

Describe the problem

A full cluster backup will backup all jobs (including automatic jobs) in the cluster. A cluster restore into a fresh cluster will restore the jobs from the backup into the system table of the cluster being restored into.

The AUTO SPAN CONFIG job is automatically created on cluster startup. Performing a full cluster restore, will result in 2 such entries for the AUTO SPAN CONFIG job. One started by the restoring cluster, and one from the backup.

To Reproduce

Start a single node cluster
BACKUP INTO 'nodelocal://0/foo'
Shut down the cluster
Start a fresh single node cluster that uses the same cockroach-data/extern directory.
RESTORE FROM LATEST IN 'nodelocal://0/foo'
SHOW AUTOMATIC JOBS

Expected behavior
Semantics need to be ironed out for cluster backup and restore performed in a dedicated environment, and also as a secondary tenant.

BACKUP TENANT and RESTORE TENANT performed by the system tenant should be okay and only result in a single entry for the span config job. This is because a tenant restore runs in the host tenants' registry and simply writes all keys from [TenantPrefix, TenantPrefix.End] in a newly created, empty tenant. The creation of a tenant does not lead to a AUTO SPAN CONFIG job being triggered, therefore the restored job entry should be the only one of that kind.

Environment:

CockroachDB version: 21.2 onwards

Epic CRDB-8816

Jira issue: CRDB-9970

The text was updated successfully, but these errors were encountered:

irfansharif · 2021-09-14T01:22:00Z

I think we want to simply exclude backing up certain kinds of jobs, notably these automatic ones. Do we want to do the same for #68434?

adityamaru · 2021-09-14T01:35:29Z

Yeah still trying to get a repro on 68434, but for some reason, I don't see duplicate schedules on restore. Either I'm doing something wrong or we already have something in place, will check it out tomorrow.

adityamaru · 2021-09-14T01:37:36Z

I'm not very familiar with the span config job but is there checkpointed state that needs to be resumed from on restore into a fresh cluster? If we don't back it up, and simply rely on the new one created on the restoring cluster, will they reconcile to the same state as on the cluster that ran the backup, post restore?

irfansharif · 2021-09-14T03:41:38Z

There's no checkpointing, not yet -- right now it's just a scaffold of a job. When restoring, it'd be fine to discard the checkpointed state if any.

dt · 2021-09-20T14:36:47Z

My vote is we modify the job to, on Resume(), check if there is a duplicate and if so, choose to either exit or cancel the other one. Right now there's no persisted state, so it's maybe fine to just say that RESTORE always cancels the restored one, but in the future, if that changed, and there were some state in the job, it isn't clear that the restore job is always the one we want to discard, so I'd rather the job itself make the choice of which one it keeps?

blathers-crl · 2022-03-22T22:09:54Z

Hi @irfansharif, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

arulajmani · 2022-03-29T13:56:50Z

Capturing a conversation elsewhere, @adityamaru mentioned that #75060 might be addressed with this issue as well. As such, we should be able to run TestFullClusterBackup with the span config infrastructure once this issue is closed as well.

Fixes cockroachdb#70173. When restoring a cluster, we don't want to end up with two instances of the singleton reconciliation job. Release note: None

Fixes #70173. When restoring a cluster, we don't want to end up with two instances of the singleton reconciliation job. Release note: None

Fixes cockroachdb#70173. When restoring a cluster, we don't want to end up with two instances of the singleton reconciliation job. Release note: None

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-zone-configs A-disaster-recovery labels Sep 14, 2021

blathers-crl bot added the T-disaster-recovery label Sep 14, 2021

livlobo assigned adityamaru Sep 20, 2021

irfansharif self-assigned this Sep 20, 2021

adityamaru removed their assignment Sep 28, 2021

irfansharif mentioned this issue Dec 15, 2021

spanconfig: harden infrastructure for v22.1 #73874

Closed

24 tasks

msbutler mentioned this issue Jan 18, 2022

ccl/backupccl: TestFullClusterBackup failed #75089

Closed

irfansharif added the GA-blocker label Mar 22, 2022

irfansharif added the branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 label Mar 22, 2022

adityamaru mentioned this issue Mar 25, 2022

gcjob: teach GC job to respect protected timestamps for tables/indexes #78475

Merged

dt changed the title ~~backupccl: full cluster restore results in more than one AUTO SPAN CONFIG job~~ spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored Mar 31, 2022

irfansharif mentioned this issue Apr 1, 2022

spanconfig: ensure single reconciliation job post-restore #79222

Closed

craig bot closed this as completed in 173087a Apr 2, 2022

blathers-crl bot pushed a commit that referenced this issue Apr 2, 2022

spanconfig: ensure single reconciliation job post-restore

edbc961

Fixes #70173. When restoring a cluster, we don't want to end up with two instances of the singleton reconciliation job. Release note: None

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored #70173

spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored #70173

adityamaru commented Sep 14, 2021 •

edited

Loading

irfansharif commented Sep 14, 2021

adityamaru commented Sep 14, 2021

adityamaru commented Sep 14, 2021 •

edited

Loading

irfansharif commented Sep 14, 2021

dt commented Sep 20, 2021

blathers-crl bot commented Mar 22, 2022

arulajmani commented Mar 29, 2022

spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored #70173

spanconfig: AUTO SPAN CONFIG job does not handle duplicate after being restored #70173

Comments

adityamaru commented Sep 14, 2021 • edited Loading

irfansharif commented Sep 14, 2021

adityamaru commented Sep 14, 2021

adityamaru commented Sep 14, 2021 • edited Loading

irfansharif commented Sep 14, 2021

dt commented Sep 20, 2021

blathers-crl bot commented Mar 22, 2022

arulajmani commented Mar 29, 2022

adityamaru commented Sep 14, 2021 •

edited

Loading

adityamaru commented Sep 14, 2021 •

edited

Loading