Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: merge files returned to the processor during backup #65576

Merged
merged 1 commit into from
Jun 23, 2021

Conversation

dt
Copy link
Member

@dt dt commented May 21, 2021

backupccl: merge KV-returned SSTs during BACKUP

BACKUP writes row data to cloud storage in two ways:
1) the range holding the data can be instructed to export it into a
file which it writes to cloud storage directly or
2) the range holding the data can export it into a file which it
returns to the BACKUP process which then writes it to cloud storage.

When doing the second, we can be smarter than writing each file that is
returned to the SQL process immediately and individually, and instead
buffer and merge many of those files before writing the merged file to
the destination.

A future change could then build on this to opportunitically choose to
return files if they are smaller than some limit even in we were initially
operating in case (1), to make use of this merging and avoid creating
large numbers of small files.

@dt dt requested review from pbardea, adityamaru and a team May 21, 2021 20:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt
Copy link
Member Author

dt commented May 21, 2021

(tried to break this one up for easier commit-by-commit review)

@dt dt force-pushed the stitch-returned branch 4 times, most recently from f6d98e3 to 3ffbab9 Compare May 25, 2021 19:00
Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still digesting the progress flow's interaction with the batching but the changes LGTM -- nice cleanup overall! Are there any tests that need to be added to make sure we exercise this buffering?

pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved
@dt dt force-pushed the stitch-returned branch 3 times, most recently from c71a76a to 0970220 Compare June 22, 2021 13:55
pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved

// If this span starts before the last buffered span ended, we need to flush
// since it overlaps but SSTWriter demands writes in-order.
// TODO(dt): consider buffering resp until _next_ `write` to allow minimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the concurrency in the producer could make us miss many merging opportunities in the sstSink. This is also a problem that RESTORE faces with the introduced concurrency. I think this is different than what this TODO is saying; but I may have misunderstood.

I wonder if we could give the channel a bit more buffer to give a bit of a lead to the producer, then the sstWriter could pull off the SSTs and buffer them in a sorted data structure. This would certainly increase the memory overhead of this buffering, but I worry that we won't see much improvement in the merging otherwise.

Alternatively, maybe we can sort the buffered files by span in the sstSink during flush (they should all be non-overlapping) and merge them there rather than in write?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what this TODO is about: right now if a later file gets to the sink before an earlier one, the sink is currently forced to flush the later, then add the earlier to its buffer. The todo is suggesting the sink buffer the received file before pushing its contents into the SST writer, so that if the next (or next n) files show up and are indeed earlier, we can flush them first instead of being forced to start a new file.

I just didn't want to do that in this change since it looks a little bit different after the direct-write-ssts change that is queued up next.

@pbardea pbardea self-requested a review June 22, 2021 14:27
BACKUP writes row data to cloud storage in two ways:
    1) the range holding the data can be instructed to export it into a
    file which it writes to cloud storage directly or
    2) the range holding the data can export it into a file which it
    returns to the BACKUP process which then writes it to cloud storage.

When doing the second, we can be smarter than writing each file that is
returned to the SQL process immediately and individually, and instead
buffer and merge many of those files before writing the merged file to
the destination.

A future change could then build on this to opportunitically choose to
return files if they are smaller than some limit even in we were initially
operating in case (1), to make use of this merging and avoid creating
large numbers of small files.

Release note: none.
@dt dt force-pushed the stitch-returned branch from 0970220 to b10d7d5 Compare June 22, 2021 18:22
@dt
Copy link
Member Author

dt commented Jun 23, 2021

TFTR!

@dt
Copy link
Member Author

dt commented Jun 23, 2021

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 23, 2021

Build succeeded:

@craig craig bot merged commit 231e1e5 into cockroachdb:master Jun 23, 2021
@dt dt deleted the stitch-returned branch June 23, 2021 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants