-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: improve restore checkpointing with span frontier #92002
backupccl: improve restore checkpointing with span frontier #92002
Conversation
95acd5b
to
5c8c5fe
Compare
5c8c5fe
to
aa1b49e
Compare
906a877
to
3fbd451
Compare
3fbd451
to
7417137
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, did not mean to approve this heh
ec0d7ec
to
44bf16b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few more comments. getting close!
0a6ecae
to
45615c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting closer!
458c304
to
f96c954
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing these before moving on to TestRestoreCheckpointing
pkg/ccl/backupccl/restore_job.go
Outdated
@@ -221,6 +223,52 @@ func makeBackupLocalityMap( | |||
return backupLocalityMap, nil | |||
} | |||
|
|||
// filterCompletedImportSpans takes imported spans and filters them based on completed spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's better style to write comments such that the code reader doesn't need to have the all the context of the variable names used in the function. I'd rewrite to:
filterCompletedImportSpans constructs a spanFrontier which tracks ingestion progress on the key space we seek to restore and a slice of spans we still need to restore. It constructs these objects using the passed in importSpans, a set of key spans which represent the whole key space we're restoring, and the passed in completedSpans, which represents a set of key spans that have already been restored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i understand the source of the stress failure. see comments.
pkg/ccl/backupccl/backup_test.go
Outdated
// We create these tables to ensure there are enough spans to restore and that we have partial progress | ||
// when stopping the job. | ||
var numTables int | ||
for char := 'a'; char <= 'g'; char++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use more readable table names as we discussed.
pkg/ccl/backupccl/backup_test.go
Outdated
|
||
restoreQuery := `RESTORE DATABASE r1 FROM 'nodelocal://0/test-root' WITH detached, new_db_name=r2` | ||
|
||
backupTableID := sqlutils.QueryTableID(t, conn, "r1", "public", "a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: declare the backupTableID
and restoreQuery
vars as close as possible to where they are used.
pkg/ccl/backupccl/backup_test.go
Outdated
t.Logf("checking query %q", query) | ||
|
||
var totalExpectedResponses int | ||
if strings.Contains(query, "RESTORE") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks copied from the old test. if strings.Contains(query, "RESTORE")
will always be the case in this test. Please clean this do function up if you don't plan to use common code with the other test.
pkg/ccl/backupccl/backup_test.go
Outdated
}() | ||
|
||
// Allow one of the total expected responses to proceed. | ||
for i := 0; i < 1; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this in a loop?
}) | ||
}) | ||
|
||
// Close the channel to allow all remaining responses to proceed. We do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to pause the job before we do this? this allows the jobs to complete before you check the progress. This likely explains why stress is failing.
|
||
do(restoreQuery, checkFraction) | ||
|
||
sqlDB.QueryRow(t, `SELECT job_id FROM crdb_internal.jobs ORDER BY created DESC LIMIT 1`).Scan(&jobID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why check all this stuff before the job is paused?
5debf51
to
39981be
Compare
|
||
backupTableID := sqlutils.QueryTableID(t, conn, "r1", "public", "table1") | ||
err := retry.ForDuration(testutils.DefaultSucceedsSoonDuration, func() error { | ||
return check(ctx, inProgressState{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this error is not nil, the test should fail. That's what happened in the stress race result. since the error is not handled, the test was able to proceed, leading weirdness downstream.
@stevendanna @adityamaru I think the non test code is ready for a look. we're still sorting out a race in one of the tests. |
39981be
to
9156ef2
Compare
Fixes: cockroachdb#81116, cockroachdb#87843 Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job the checkpointing mechanism would potentially not account for completed work. This change allows completed spans to be skipped over when restoring.
9156ef2
to
455ec53
Compare
checkpointingFrontier, newImportSpans, err := filterCompletedImportSpans(completedSpans, importSpans) | ||
require.NoError(t, err) | ||
// Construct span slice in order to check that the filtered span frontier has correct spans' timestamps forwarded. | ||
expectedCompletedFrontierSpans := make([]roachpb.Span, 0, len(importSpans)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what coverage does this constructed frontier add, given that we pass in an expectedCheckpointingFrontier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to make the contains call easier on line 547, I wasn't sure if you could directly call a contains on the completed spans slice
closing in favor of #97862 |
Fixes: #81116, #87843
Release note (performance improvement): Previously, whenever a user resumed a paused
RESTORE
job the checkpointing mechanism would potentially not account for completed work. This change allows completed spans to be skipped over when restoring.