Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: improve restore checkpointing with span frontier #92002

Closed

Conversation

baoalvin1
Copy link
Contributor

Fixes: #81116, #87843

Release note (performance improvement): Previously, whenever a user resumed a paused RESTORE job the checkpointing mechanism would potentially not account for completed work. This change allows completed spans to be skipped over when restoring.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch 4 times, most recently from 95acd5b to 5c8c5fe Compare November 16, 2022 20:42
@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch from 5c8c5fe to aa1b49e Compare November 17, 2022 17:10
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/jobs/jobspb/jobs.proto Show resolved Hide resolved
@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch 6 times, most recently from 906a877 to 3fbd451 Compare December 6, 2022 20:55
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch from 3fbd451 to 7417137 Compare December 6, 2022 22:14
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, did not mean to approve this heh

@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch 4 times, most recently from ec0d7ec to 44bf16b Compare December 7, 2022 00:02
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few more comments. getting close!

pkg/ccl/backupccl/restore_checkpointing_test.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_checkpointing_test.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_checkpointing_test.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_checkpointing_test.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_checkpointing_test.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting closer!

pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch 4 times, most recently from 458c304 to f96c954 Compare December 19, 2022 19:29
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing these before moving on to TestRestoreCheckpointing

@@ -221,6 +223,52 @@ func makeBackupLocalityMap(
return backupLocalityMap, nil
}

// filterCompletedImportSpans takes imported spans and filters them based on completed spans.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's better style to write comments such that the code reader doesn't need to have the all the context of the variable names used in the function. I'd rewrite to:

filterCompletedImportSpans constructs a spanFrontier which tracks ingestion progress on the key space we seek to restore and a slice of spans we still need to restore. It constructs these objects using the passed in importSpans, a set of key spans which represent the whole key space we're restoring, and the passed in completedSpans, which represents a set of key spans that have already been restored.

pkg/ccl/backupccl/restore_job.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_span_covering_test.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i understand the source of the stress failure. see comments.

// We create these tables to ensure there are enough spans to restore and that we have partial progress
// when stopping the job.
var numTables int
for char := 'a'; char <= 'g'; char++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use more readable table names as we discussed.


restoreQuery := `RESTORE DATABASE r1 FROM 'nodelocal://0/test-root' WITH detached, new_db_name=r2`

backupTableID := sqlutils.QueryTableID(t, conn, "r1", "public", "a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: declare the backupTableID and restoreQuery vars as close as possible to where they are used.

t.Logf("checking query %q", query)

var totalExpectedResponses int
if strings.Contains(query, "RESTORE") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks copied from the old test. if strings.Contains(query, "RESTORE") will always be the case in this test. Please clean this do function up if you don't plan to use common code with the other test.

}()

// Allow one of the total expected responses to proceed.
for i := 0; i < 1; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in a loop?

})
})

// Close the channel to allow all remaining responses to proceed. We do this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to pause the job before we do this? this allows the jobs to complete before you check the progress. This likely explains why stress is failing.


do(restoreQuery, checkFraction)

sqlDB.QueryRow(t, `SELECT job_id FROM crdb_internal.jobs ORDER BY created DESC LIMIT 1`).Scan(&jobID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check all this stuff before the job is paused?

@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch 3 times, most recently from 5debf51 to 39981be Compare December 20, 2022 17:20

backupTableID := sqlutils.QueryTableID(t, conn, "r1", "public", "table1")
err := retry.ForDuration(testutils.DefaultSucceedsSoonDuration, func() error {
return check(ctx, inProgressState{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this error is not nil, the test should fail. That's what happened in the stress race result. since the error is not handled, the test was able to proceed, leading weirdness downstream.

@msbutler
Copy link
Collaborator

@stevendanna @adityamaru I think the non test code is ready for a look. we're still sorting out a race in one of the tests.

@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch from 39981be to 9156ef2 Compare December 21, 2022 17:11
Fixes: cockroachdb#81116, cockroachdb#87843

Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job
the checkpointing mechanism would potentially not account for completed work. This change
allows completed spans to be skipped over when restoring.
@baoalvin1 baoalvin1 force-pushed the alvinbao-restore-checkpointing branch from 9156ef2 to 455ec53 Compare December 21, 2022 17:18
checkpointingFrontier, newImportSpans, err := filterCompletedImportSpans(completedSpans, importSpans)
require.NoError(t, err)
// Construct span slice in order to check that the filtered span frontier has correct spans' timestamps forwarded.
expectedCompletedFrontierSpans := make([]roachpb.Span, 0, len(importSpans))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what coverage does this constructed frontier add, given that we pass in an expectedCheckpointingFrontier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to make the contains call easier on line 547, I wasn't sure if you could directly call a contains on the completed spans slice

@msbutler
Copy link
Collaborator

msbutler commented Mar 2, 2023

closing in favor of #97862

@msbutler msbutler closed this Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions
3 participants