backupccl: add hidden cluster setting to split exports by TS #68981

aliher1911 · 2021-08-16T11:58:39Z

This diff enables usage of split keys by timestamp during export in
backup processor. This feature is enables by setting
bulkio.backup.split_keys_on_timestamps to true.
When enabled, generated backup SST files would be split on timestamps
of a key when backing up history.

Release note: None

cockroach-teamcity · 2021-08-16T11:58:45Z

This change is

aliher1911 · 2021-08-16T12:00:44Z

@dt Would this be sufficient for backup processor to use new split by timestamp functionality? I don't see any immediate tests within backupccl that test that.
Happy to update if you point me to something that we already have.

dt · 2021-08-16T20:51:20Z

I think we want to make sure that that the processor does not flush a file that ends mid-key, on this line:

cockroach/pkg/ccl/backupccl/backup_processor.go

Line 705 in becc3f9

if s.flushedSize > targetFileSize.Get(s.conf.settings) {

To do that, I think we'll need to add a parameter to returnedSST that indicates if it ended mid-key to not flush there?

aliher1911 · 2021-08-16T21:44:52Z

Technically, once we read the full range, the ending timestamp will be empty as well as the resume span. Otherwise there's no guarantee that we will ever hit the last timestamp of any key until the end of the range. Will that be a problem?

dt · 2021-08-16T22:11:57Z

We can't report progress until we flush a file. Ideally we'd do so before range-end, but I'm okay with waiting until a range-end if we have to (e.g. the whole range could be revisions of a single key in the worst case no matter what). If we decide we care, we could flush mid-iteration of the returned SST into the cloud file writer.

aliher1911 · 2021-08-17T10:22:42Z

I see what you mean! I added the flag first, but it looks like completedSpans is doing exactly the same thing. If response had no ResumeSpan we reached end of range and we use it as criteria to set completedSpans to 1.
Maybe turn that to bool instead and just increment completed spans and use it as criteria to flush? I think it depends if completedSpans has any plans to be more than 1 in future.

dt · 2021-08-17T10:52:15Z

So yeah, it is true that completedSpans will always be zero if returnedSST ends mid-key, so technically we could just add != 0 to the flush condition.

But today, we can still flush an in-progress cloud file and open a new one to obey our size limit on that line even when completedSpans is zero, i.e. we got a resumeSpan -- we just will not increase the number of "completed" spans when we update the progress, but we will record the span covered by that flushed file.

We only really want to not flush the file, and record its coverage, if its current coverage ends mid-key. I think I'd just add a new bool to returnedSST to indicate the mid-key ending, and then check that in the flush conditional.

(Indeed, as an optimization, I might even go back to pebbleExportToSst and add a small amount of leniency on targetSize when stopMidKey is true, e.g. hardTargetSize = targetSize * 1.05; ... if (newKey && accumulated > targetSize) || splitMitKey && accumulated > hardTargetSize) ... just since I think most callers slightly prefer spans that end not mid-key, since AFAIK, any caller that gets a mid-key file will be forced to wait for the resume before it can do anything with it, or at least the suffix of it. But anyway, that can be a separate thing)

aliher1911 · 2021-08-18T12:13:22Z

What I realised adding test is that if we request breaking mid key we start generating timestamps even for the case where we don't want history. I think that should be tackled inside kv/pebble where we send resume span and avoid timestamp if we stop at new key. I'll make a separate change for that as it is 1 liner but needs 100 lines of test.

This diff enables usage of split keys by timestamp during export in backup processor. This feature is enables by setting bulkio.backup.split_keys_on_timestamps to true. When enabled, generated backup SST files would be split on timestamps of a key when backing up history. Release note: None

aliher1911 · 2021-08-18T13:52:16Z

bors r=dt

craig · 2021-08-18T15:14:05Z

Build succeeded:

GitHub CI (Cockroach)

aliher1911 requested a review from dt August 16, 2021 11:58

aliher1911 force-pushed the add_split_key_to_backup_processor branch from 3820802 to e3ff1f9 Compare August 17, 2021 10:13

aliher1911 force-pushed the add_split_key_to_backup_processor branch 4 times, most recently from 236d8b7 to 7475388 Compare August 18, 2021 11:28

dt approved these changes Aug 18, 2021

View reviewed changes

aliher1911 marked this pull request as ready for review August 18, 2021 12:23

aliher1911 requested a review from a team August 18, 2021 12:23

aliher1911 force-pushed the add_split_key_to_backup_processor branch from 7475388 to 5526fc8 Compare August 18, 2021 12:39

craig bot merged commit d13d9cf into cockroachdb:master Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: add hidden cluster setting to split exports by TS #68981

backupccl: add hidden cluster setting to split exports by TS #68981

aliher1911 commented Aug 16, 2021

cockroach-teamcity commented Aug 16, 2021

aliher1911 commented Aug 16, 2021

dt commented Aug 16, 2021

aliher1911 commented Aug 16, 2021

dt commented Aug 16, 2021

aliher1911 commented Aug 17, 2021

dt commented Aug 17, 2021

aliher1911 commented Aug 18, 2021

aliher1911 commented Aug 18, 2021

craig bot commented Aug 18, 2021

backupccl: add hidden cluster setting to split exports by TS #68981

backupccl: add hidden cluster setting to split exports by TS #68981

Conversation

aliher1911 commented Aug 16, 2021

cockroach-teamcity commented Aug 16, 2021

aliher1911 commented Aug 16, 2021

dt commented Aug 16, 2021

aliher1911 commented Aug 16, 2021

dt commented Aug 16, 2021

aliher1911 commented Aug 17, 2021

dt commented Aug 17, 2021

aliher1911 commented Aug 18, 2021

aliher1911 commented Aug 18, 2021

craig bot commented Aug 18, 2021