Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: add hidden cluster setting to split exports by TS #68981

Merged

Conversation

aliher1911
Copy link
Contributor

This diff enables usage of split keys by timestamp during export in
backup processor. This feature is enables by setting
bulkio.backup.split_keys_on_timestamps to true.
When enabled, generated backup SST files would be split on timestamps
of a key when backing up history.

Release note: None

@aliher1911 aliher1911 requested a review from dt August 16, 2021 11:58
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911
Copy link
Contributor Author

@dt Would this be sufficient for backup processor to use new split by timestamp functionality? I don't see any immediate tests within backupccl that test that.
Happy to update if you point me to something that we already have.

@dt
Copy link
Member

dt commented Aug 16, 2021

I think we want to make sure that that the processor does not flush a file that ends mid-key, on this line:

if s.flushedSize > targetFileSize.Get(s.conf.settings) {

To do that, I think we'll need to add a parameter to returnedSST that indicates if it ended mid-key to not flush there?

@aliher1911
Copy link
Contributor Author

Technically, once we read the full range, the ending timestamp will be empty as well as the resume span. Otherwise there's no guarantee that we will ever hit the last timestamp of any key until the end of the range. Will that be a problem?

@dt
Copy link
Member

dt commented Aug 16, 2021

We can't report progress until we flush a file. Ideally we'd do so before range-end, but I'm okay with waiting until a range-end if we have to (e.g. the whole range could be revisions of a single key in the worst case no matter what). If we decide we care, we could flush mid-iteration of the returned SST into the cloud file writer.

@aliher1911 aliher1911 force-pushed the add_split_key_to_backup_processor branch from 3820802 to e3ff1f9 Compare August 17, 2021 10:13
@aliher1911
Copy link
Contributor Author

I see what you mean! I added the flag first, but it looks like completedSpans is doing exactly the same thing. If response had no ResumeSpan we reached end of range and we use it as criteria to set completedSpans to 1.
Maybe turn that to bool instead and just increment completed spans and use it as criteria to flush? I think it depends if completedSpans has any plans to be more than 1 in future.

@dt
Copy link
Member

dt commented Aug 17, 2021

So yeah, it is true that completedSpans will always be zero if returnedSST ends mid-key, so technically we could just add != 0 to the flush condition.

But today, we can still flush an in-progress cloud file and open a new one to obey our size limit on that line even when completedSpans is zero, i.e. we got a resumeSpan -- we just will not increase the number of "completed" spans when we update the progress, but we will record the span covered by that flushed file.

We only really want to not flush the file, and record its coverage, if its current coverage ends mid-key. I think I'd just add a new bool to returnedSST to indicate the mid-key ending, and then check that in the flush conditional.

(Indeed, as an optimization, I might even go back to pebbleExportToSst and add a small amount of leniency on targetSize when stopMidKey is true, e.g. hardTargetSize = targetSize * 1.05; ... if (newKey && accumulated > targetSize) || splitMitKey && accumulated > hardTargetSize) ... just since I think most callers slightly prefer spans that end not mid-key, since AFAIK, any caller that gets a mid-key file will be forced to wait for the resume before it can do anything with it, or at least the suffix of it. But anyway, that can be a separate thing)

@aliher1911 aliher1911 force-pushed the add_split_key_to_backup_processor branch 4 times, most recently from 236d8b7 to 7475388 Compare August 18, 2021 11:28
@aliher1911
Copy link
Contributor Author

What I realised adding test is that if we request breaking mid key we start generating timestamps even for the case where we don't want history. I think that should be tackled inside kv/pebble where we send resume span and avoid timestamp if we stop at new key. I'll make a separate change for that as it is 1 liner but needs 100 lines of test.

@aliher1911 aliher1911 marked this pull request as ready for review August 18, 2021 12:23
@aliher1911 aliher1911 requested a review from a team August 18, 2021 12:23
This diff enables usage of split keys by timestamp during export in
backup processor. This feature is enables by setting
bulkio.backup.split_keys_on_timestamps to true.
When enabled, generated backup SST files would be split on timestamps
of a key when backing up history.

Release note: None
@aliher1911 aliher1911 force-pushed the add_split_key_to_backup_processor branch from 7475388 to 5526fc8 Compare August 18, 2021 12:39
@aliher1911
Copy link
Contributor Author

bors r=dt

@craig
Copy link
Contributor

craig bot commented Aug 18, 2021

Build succeeded:

@craig craig bot merged commit d13d9cf into cockroachdb:master Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants