Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: failed to write metadata SST during TestTenantBackupNemesis #113852

Closed
stevendanna opened this issue Nov 6, 2023 · 2 comments
Closed
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@stevendanna
Copy link
Collaborator

stevendanna commented Nov 6, 2023

Describe the problem

=== RUN   TestTenantBackupNemesis
    test_log_scope.go:170: test logs captured to: /tmp/_tmp/abc2733f63574a4c007456beb3ec251b/logTestTenantBackupNemesis1750470925
    test_log_scope.go:81: use -show-logs to present logs inline
    tenant_backup_nemesis_test.go:528: backup-nemesis: full backup started: BACKUP INTO $1 WITH include_all_secondary_tenants
    tenant_backup_nemesis_test.go:588: backup-nemesis: RENAME TENANT started
    tenant_backup_nemesis_test.go:379: no tenants to rename
    tenant_backup_nemesis_test.go:593: backup-nemesis: RENAME TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: DELETE TENANT started
    tenant_backup_nemesis_test.go:366: no tenants to delete
    tenant_backup_nemesis_test.go:593: backup-nemesis: DELETE TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: DELETE TENANT started
    tenant_backup_nemesis_test.go:366: no tenants to delete
    tenant_backup_nemesis_test.go:593: backup-nemesis: DELETE TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: REPLICATE TENANT started
    tenant_backup_nemesis_test.go:593: backup-nemesis: REPLICATE TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: DELETE TENANT started
    tenant_backup_nemesis_test.go:593: backup-nemesis: DELETE TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: CREATE UNIQUE INDEX (will fail) started
    tenant_backup_nemesis_test.go:532: backup-nemesis: full backup finished
    tenant_backup_nemesis_test.go:546: backup-nemesis: incremental backup started: BACKUP INTO LATEST IN 'nodelocal://1/cluster-backup' AS OF SYSTEM TIME 1699264047155749000.0000000000 WITH include_all_secondary_tenants
    tenant_backup_nemesis_test.go:593: backup-nemesis: CREATE UNIQUE INDEX (will fail) finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: CREATE INDEX started
    tenant_backup_nemesis_test.go:550: backup-nemesis: incremental backup finished
    tenant_backup_nemesis_test.go:546: backup-nemesis: incremental backup started: BACKUP INTO LATEST IN 'nodelocal://1/cluster-backup' AS OF SYSTEM TIME 1699264047310436000.0000000000 WITH include_all_secondary_tenants
    tenant_backup_nemesis_test.go:550: backup-nemesis: incremental backup finished
    tenant_backup_nemesis_test.go:546: backup-nemesis: incremental backup started: BACKUP INTO LATEST IN 'nodelocal://1/cluster-backup' AS OF SYSTEM TIME 1699264047460111000.0000000000 WITH include_all_secondary_tenants
    tenant_backup_nemesis_test.go:593: backup-nemesis: CREATE INDEX finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: CREATE TENANT started
    tenant_backup_nemesis_test.go:593: backup-nemesis: CREATE TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: CREATE TENANT started
    tenant_backup_nemesis_test.go:593: backup-nemesis: CREATE TENANT finished
    tenant_backup_nemesis_test.go:588: backup-nemesis: CANCELED IMPORT INTO started
    tenant_backup_nemesis_test.go:550: backup-nemesis: incremental backup finished
    tenant_backup_nemesis_test.go:546: backup-nemesis: incremental backup started: BACKUP INTO LATEST IN 'nodelocal://1/cluster-backup' AS OF SYSTEM TIME 1699264047615010000.0000000000 WITH include_all_secondary_tenants
    tenant_backup_nemesis_test.go:593: backup-nemesis: CANCELED IMPORT INTO finished
    tenant_backup_nemesis_test.go:248: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/ccl/backupccl/tenant_backup_nemesis_test.go:248
                Error:          Received unexpected error:
                                pq: failed to run backup: writing forward-compat metadata sst: pebble: keys must be added in strictly increasing order: "span/\x12\xfe\x96\x00\x01\x12\xfe\x98\x00\x01"/0,0#0,SET, "span/\x12\xfe\x96\x00\x01\x12\xfe\x97\x00\x01"/1699264047.460111000,0#0,SET
                Test:           TestTenantBackupNemesis
    testutils.go:289: no Invalid Descriptors
    tenant_backup_nemesis_test.go:132: leaving /tmp/_tmp/abc2733f63574a4c007456beb3ec251b/TestTenantBackupNemesis3837983616 for inspection
    panic.go:523: -- test log scope end --
test logs left over in: /tmp/_tmp/abc2733f63574a4c007456beb3ec251b/logTestTenantBackupNemesis1750470925

Jira issue: CRDB-33228

@stevendanna stevendanna added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 6, 2023
Copy link

blathers-crl bot commented Nov 6, 2023

cc @cockroachdb/disaster-recovery

@jeffswenson
Copy link
Collaborator

Here's a minimal reproduction:

set cluster setting kv.bulkio.write_metadata_sst.enabled = true;

-- Create one tenant so the initial backup is not empty
create virtual cluster 'a';
alter virtual cluster 'a' start service shared;

backup into 'nodelocal://1/backup' WITH include_all_secondary_tenants;

-- Create two+ tenants before taking the incremental
create virtual cluster 'b';
alter virtual cluster 'b' start service shared;

create virtual cluster 'c';
alter virtual cluster 'c' start service shared;

backup into latest in 'nodelocal://1/backup' WITH include_all_secondary_tenants;

This doesn't reproduce the issue on master because it was fixed by
2420992.

Here's a unit test that reproduces the bug in the sst metadata writer:

func TestSpanComparator(t *testing.T) {
	st := cluster.MakeTestingClusterSettings()
	f := &storage.MemObject{}
	sst := storage.MakeBackupSSTWriter(context.Background(), st, &f.Buffer)
	defer sst.Close()

	// "span/\x12\xfe\x96\x00\x01\x12\xfe\x98\x00\x01"/0,0#0,SET
	// "span/\x12\xfe\x96\x00\x01\x12\xfe\x97\x00\x01"/1699264047.460111000,0#0,SET
	m := backuppb.BackupManifest {
		Spans: []roachpb.Span {{
			Key: roachpb.Key { 1 },
			EndKey: roachpb.Key { 2 },
		}},
		IntroducedSpans: []roachpb.Span {{
			Key: roachpb.Key { 1 },
			EndKey: roachpb.Key { 3 },
		}},
	}

	require.NoError(t, writeSpansToMetadata(context.Background(), sst, &m))
}

The bug was caused because the metadata sst writer only sorts spans and introduced spans by key. It will choke if an introduced span starts on the same key as a span and contains the whole span. This is the change that could be made to the SST writer to avoid the error:

image

This situation arose because of a quirk/bug in backup job. Backup job would merge adjacent tenants in the introduced spans because introduced spans are roundtripped through span group. Backup doesn't coalesce the tenant spans though, because they are directly inserted into a slice. This bug was fixed by @msbutler's change because tenant spans are now disjoint and won't merge in the span group.

I don't think its worth improving the sst metadata behavior. I have an in progress PR to delete it since its unused and was made obsolete by moving descriptors and files into their own SSTs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Development

No branches or pull requests

2 participants