backupccl: checkpoint restore progress with a span frontier #97862

msbutler · 2023-03-01T18:44:34Z

Previously, after a node restart or pause event, a resumed restore would have to
redo a significant amount of work due to our naive progress checkpointing
procedure described in #87843.

This patch upgrades our checkpointing procedure to use a span frontier,
significantly reducing wasted work. Here's the basic idea:

Create a span frontier behind a lock on the coordinator.
When the coordinator receives a progress update about a span that was
ingested, forward that span's time stamp in the frontier to the hardcoded
completedSpanTime.
The frontier is periodically persisted to the job record every 15 seconds
or after 5% progress. Note, only the first N spans are persisted, to bound the
amount of data written to the jobs table. The bound is set by the new
restore.frontier_checkpoint_max_bytes private setting.
If the job is resumed after some progress was persisted, the generative
split and scatter processor will skip the completed spans in the persisted
frontier. Note that if a part of requiredSpan was complete, only the subset
that is incomplete will get processed into a restoreSpanEntry.

As part of this work, many of the function signatures in the restore codebase
were refactored to prevent a large number input parameters. Further, much of
the restore checkpointing code was moved to the seperate restore_progress.go
file.

Fixes #87843

Release note: None

cockroach-teamcity · 2023-03-01T18:44:45Z

This change is

msbutler · 2023-03-01T18:46:41Z

First commit getting reviewed #97800.

With better checkpointing, the restore with pause test completed in half the time, in about 15 minutes.

pkg/util/span/frontier.go

pkg/ccl/backupccl/restore_job.go

msbutler · 2023-03-03T18:57:58Z

a heads up that i'm reworking this. hold off on review for a moment!

msbutler · 2023-03-06T17:16:14Z

@adityamaru @rhu713 RFAL!

rhu713

LGTM!

rhu713 · 2023-03-08T20:17:15Z

pkg/ccl/backupccl/restore_span_covering.go

+	if f.useFrontierCheckpointing {
+		return f.findToDoSpans(requiredSpan)
+	}
+	if requiredSpan.EndKey.Compare(f.highWaterMark) < 0 {


I think this can be <= 0 since EndKey of span is exclusive?

huh, yeah it can be <=0. In other words, if requiredSpan.EndKey.Compare(f.highWaterMark) == 0, the last key in the required span was persisted. This is a small bug in 22.2 and earlier.

pkg/ccl/backupccl/restore_progress.go

pkg/ccl/backupccl/restore_span_covering.go

pkg/ccl/backupccl/restore_progress.go

pkg/ccl/backupccl/restore_span_covering.go

Previously, after a node restart or pause event, a resumed restore would have to redo a significant amount of work due to our naive progress checkpointing procedure described in cockroachdb#87843. This patch upgrades our checkpointing procedure to use a span frontier, significantly reducing wasted work. Here's the basic idea: 1. Create a span frontier behind a lock on the coordinator. 2. When the coordinator receives a progress update about a span that was ingested, forward that span's time stamp in the frontier to the hardcoded `completedSpanTime`. 3. The frontier is periodically persisted to the job record every 15 seconds or after 5% progress. Note, only the first N spans are persisted, to bound the amount of data written to the jobs table. The bound is set by the new restore.frontier_checkpoint_max_bytes private setting. 4. If the job is resumed after some progress was persisted, the generative split and scatter processor will skip the completed spans in the persisted frontier. Note that if a part of requiredSpan was complete, only the subset that is incomplete will get processed into a restoreSpanEntry. As part of this work, many of the function signatures in the restore codebase were refactored to prevent a large number input parameters. Further, much of the restore checkpointing code was moved to the seperate restore_progress.go file. Fixes cockroachdb#87843 Release note: None

msbutler · 2023-03-13T17:36:27Z

TFTRs!!

bors r=adityamaru, rhu713

msbutler · 2023-03-13T17:45:29Z

test run indicates all is well.

craig · 2023-03-13T18:18:56Z

Build succeeded:

Bazel Essential CI (Cockroach)

PR cockroachdb#97862 introduced a subtle bug which allowed the new restore checkpointing policy to take effect before the 23_1 migrations occured. This patch ensures the new policy only takes effect after all migrations occur. Release note: None Epic: None

98368: multiregionccl,server: use cached sqlliveness.Reader, deflake ColdStartLatencyTest r=ajwerner a=ajwerner #### multitenantccl: deflake ColdStartLatencyTest This test was flakey due to the closed timestamp sometimes not leading far for global tables due to overload, and due to a cached liveness reader not being used in distsql. The former was fixed in previous commits. The latter is fixed here. Fixes: #96334 #### sql: use CachedReader for uses with sqlinstance and the sql builtins The CachedReader won't block, which in multi-region clusters is good. It will mean that in some cases, it'll state that a sessions is alive when it most certainly is not. Currently, nobody needs synchronous semantics. This is a major part of fixing the TestColdStartLatency as sometimes distsql planning would block. That's not acceptable -- the idea that query physical planning can need to wait for a cross-region RPC is unacceptable. #### sqlliveness: re-arrange APIs to clarify when the API was blocking By "default" the implementation of Reader was blocking and there was a method to get a handle to a non-blocking CachedReader(). This asymmetry does not aid understandability. The non-blocking reader came later, but it is generally the more desirable interface. Release note: None 98519: storage: More CheckSSTConflicts fixes r=erikgrinaker a=itsbilal A few additional fixes around CheckSSTConflicts, stats calculations, and Next()ing logic, caught by kvnemesis. Hopefully the last of its kind. Also re-enable kvnemesis testing for range keys in AddSSTable, reverting #98475. Fixes #94141. Fixes #98473. Informs #94876. Epic: none Release note: None 98567: backupccl: use correct version gate for restore checkpointing r=adityamaru a=msbutler PR #97862 introduced a subtle bug which allowed the new restore checkpointing policy to take effect before the 23_1 migrations occured. This patch ensures the new policy only takes effect after all migrations occur. Release note: None Epic: None Co-authored-by: ajwerner <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Michael Butler <[email protected]>

msbutler self-assigned this Mar 1, 2023

msbutler force-pushed the butler-restore-checkpoint branch 7 times, most recently from 662dec6 to 9544f54 Compare March 2, 2023 14:08

msbutler requested review from rhu713 and adityamaru March 2, 2023 14:08

msbutler marked this pull request as ready for review March 2, 2023 14:09

msbutler requested review from a team as code owners March 2, 2023 14:09

msbutler requested review from herkolategan, smg260 and rharding6373 and removed request for a team, herkolategan, smg260 and rharding6373 March 2, 2023 14:09

miretskiy reviewed Mar 2, 2023

View reviewed changes

pkg/util/span/frontier.go Outdated Show resolved Hide resolved

msbutler commented Mar 2, 2023

View reviewed changes

pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved

msbutler mentioned this pull request Mar 2, 2023

backupccl: improve restore checkpointing with span frontier #92002

Closed

msbutler force-pushed the butler-restore-checkpoint branch from 9544f54 to 18ee093 Compare March 2, 2023 18:05

msbutler force-pushed the butler-restore-checkpoint branch from 18ee093 to 5fc70cb Compare March 6, 2023 14:36

msbutler added the T-disaster-recovery label Mar 6, 2023

msbutler force-pushed the butler-restore-checkpoint branch from 5fc70cb to 3582a82 Compare March 6, 2023 17:12

msbutler requested a review from dt March 6, 2023 17:16

msbutler force-pushed the butler-restore-checkpoint branch from 3582a82 to 5bb1f8a Compare March 7, 2023 22:55

rhu713 approved these changes Mar 8, 2023

View reviewed changes

adityamaru suggested changes Mar 10, 2023

View reviewed changes

msbutler force-pushed the butler-restore-checkpoint branch from 5bb1f8a to 1b81c23 Compare March 13, 2023 13:38

adityamaru self-requested a review March 13, 2023 13:52

msbutler force-pushed the butler-restore-checkpoint branch from 1b81c23 to bd917b3 Compare March 13, 2023 14:16

adityamaru approved these changes Mar 13, 2023

View reviewed changes

msbutler force-pushed the butler-restore-checkpoint branch from bd917b3 to 2ea46c8 Compare March 13, 2023 16:22

craig bot merged commit e71ac0f into cockroachdb:master Mar 13, 2023

msbutler deleted the butler-restore-checkpoint branch March 13, 2023 18:19

msbutler mentioned this pull request Mar 14, 2023

backupccl: use correct version gate for restore checkpointing #98567

Merged

This was referenced Mar 14, 2023

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

Closed

backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: checkpoint restore progress with a span frontier #97862

backupccl: checkpoint restore progress with a span frontier #97862

msbutler commented Mar 1, 2023 •

edited

Loading

cockroach-teamcity commented Mar 1, 2023

msbutler commented Mar 1, 2023

msbutler commented Mar 3, 2023

msbutler commented Mar 6, 2023

rhu713 left a comment

rhu713 Mar 8, 2023

msbutler Mar 13, 2023

msbutler commented Mar 13, 2023

msbutler commented Mar 13, 2023

craig bot commented Mar 13, 2023

backupccl: checkpoint restore progress with a span frontier #97862

backupccl: checkpoint restore progress with a span frontier #97862

Conversation

msbutler commented Mar 1, 2023 • edited Loading

cockroach-teamcity commented Mar 1, 2023

msbutler commented Mar 1, 2023

msbutler commented Mar 3, 2023

msbutler commented Mar 6, 2023

rhu713 left a comment

Choose a reason for hiding this comment

rhu713 Mar 8, 2023

Choose a reason for hiding this comment

msbutler Mar 13, 2023

Choose a reason for hiding this comment

msbutler commented Mar 13, 2023

msbutler commented Mar 13, 2023

craig bot commented Mar 13, 2023

msbutler commented Mar 1, 2023 •

edited

Loading