Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: improve backup-restore roundtrip testing of multiple tenants #108745

Closed
msbutler opened this issue Aug 14, 2023 · 4 comments · Fixed by #113851
Closed

backupccl: improve backup-restore roundtrip testing of multiple tenants #108745

msbutler opened this issue Aug 14, 2023 · 4 comments · Fixed by #113851
Assignees
Labels
C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.

Comments

@msbutler
Copy link
Collaborator

msbutler commented Aug 14, 2023

A recent escalation revealed that the following queries from the system tenant would break RESTORE's of tenants:

CREATE TENANT foo
BACKUP  INTO 'nodelocal://1/clusterwide' with include_all_secondary_tenants;
CREATE TENANT baz
BACKUP  INTO LATEST IN 'nodelocal://1/clusterwide' with include_all_secondary_tenants;
RESTORE TENANT 2 FROM LATEST IN 'nodelocal://1/clusterwide';

We ought to develop a unit test which runs the following system tenant workload:

  1. Runs various operations on application tenants (like create, drop, alter tenant, run c2c, upgrade tenant).
  2. Runs incremental backups of individual tenants and with the include_all_virtual_clusters arg at various points in this workload
  3. Restores these incremental backups

A poor man's version of this test is in #108743

Jira issue: CRDB-30619

@msbutler msbutler added the C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. label Aug 14, 2023
@andy-kimball
Copy link
Contributor

@stevendanna, is there any update on this? We'd like to re-enable incremental backups.

@stevendanna
Copy link
Collaborator

@andy-kimball Will have an update for you on this one by the end of the week.

@andy-kimball
Copy link
Contributor

Following up here, any update?

@stevendanna
Copy link
Collaborator

stevendanna commented Nov 6, 2023

Apologies that this has sat for so long. I've added the tests outline in this issue here:

#113851

I've run a stress run with these new nemeses in place for 2 days on a GCE worker. I still have this running and will keep an eye on it over this week. While simple, I've run it with past bugs re-inserted to ensure that this randomised approach would have caught the issue over time.

We've hit one issue so far:

This is in a default-off feature, but I am currently investigating to make sure that it isn't indicative of another problem.

Note that correct tenant restores currently still requires that tenant IDs are never re-used. We believe that this is the case in CC with the exception of manual action. That will be the case until we either move to per-tenant backup schedules or remove the use of ClearRange in the tenant GC path. We likely should do both over time.

I think we can re-enable incrementals for CC once we (1) merge and backport this test, and (2) I finish investigating the above failure. Both of which I expect to happen this week.

NB: We are currently investigating a newly issue discovered the the CCT cluster related to protected timestamps. While not strictly related to incremental tenant backups, depending on what we find, it may also have an impact for incremental tenant backups. However, the investigation there is going well so we should know more shortly.

NB2: The universe heard my optimism and immediately rewarded it with a test failure. I've updated this message to reflect that. I think this failure wasn't caught on my GCEWorker as I had disabled metamorphic testing when investigating another issue.

craig bot pushed a commit that referenced this issue Nov 27, 2023
113851: backupccl: add tenant operations to backup nemesis r=dt a=stevendanna

This adds:

  - CREATE TENANT
  - DROP TENANT
  - ALTER TENANT .. RENAME
  - CREATE TENANT FROM REPLICATION

To our tenant backup nemesis tests. Additionally, it modifies the test to occasionally use cluster backups over tenant-level backups.

Fixes #108745

Release note: None

114876: streamingccl: allow CUTOVER TO LATEST before initial scan finishes r=stevendanna a=msbutler

This patch allows the user to execute ALTER TENANT x COMPLETE REPLICATION TO LATEST before the initial scan completes. After this cmd, the cutover time is set to the replicated start time.

Fixes: #114734

Epic: none

Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in 7bbcf40 Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants