backupccl: merge files returned to the processor during backup #65576

dt · 2021-05-21T20:57:58Z

backupccl: merge KV-returned SSTs during BACKUP

BACKUP writes row data to cloud storage in two ways:
1) the range holding the data can be instructed to export it into a
file which it writes to cloud storage directly or
2) the range holding the data can export it into a file which it
returns to the BACKUP process which then writes it to cloud storage.

When doing the second, we can be smarter than writing each file that is
returned to the SQL process immediately and individually, and instead
buffer and merge many of those files before writing the merged file to
the destination.

A future change could then build on this to opportunitically choose to
return files if they are smaller than some limit even in we were initially
operating in case (1), to make use of this merging and avoid creating
large numbers of small files.

cockroach-teamcity · 2021-05-21T20:58:09Z

This change is

dt · 2021-05-21T20:58:36Z

(tried to break this one up for easier commit-by-commit review)

pbardea

Still digesting the progress flow's interaction with the batching but the changes LGTM -- nice cleanup overall! Are there any tests that need to be added to make sure we exercise this buffering?

pkg/ccl/backupccl/backup_processor.go

pbardea · 2021-06-22T14:26:49Z

pkg/ccl/backupccl/backup_processor.go

+
+	// If this span starts before the last buffered span ended, we need to flush
+	// since it overlaps but SSTWriter demands writes in-order.
+	// TODO(dt): consider buffering resp until _next_ `write` to allow minimal


It seems like the concurrency in the producer could make us miss many merging opportunities in the sstSink. This is also a problem that RESTORE faces with the introduced concurrency. I think this is different than what this TODO is saying; but I may have misunderstood.

I wonder if we could give the channel a bit more buffer to give a bit of a lead to the producer, then the sstWriter could pull off the SSTs and buffer them in a sorted data structure. This would certainly increase the memory overhead of this buffering, but I worry that we won't see much improvement in the merging otherwise.

Alternatively, maybe we can sort the buffered files by span in the sstSink during flush (they should all be non-overlapping) and merge them there rather than in write?

That is what this TODO is about: right now if a later file gets to the sink before an earlier one, the sink is currently forced to flush the later, then add the earlier to its buffer. The todo is suggesting the sink buffer the received file before pushing its contents into the SST writer, so that if the next (or next n) files show up and are indeed earlier, we can flush them first instead of being forced to start a new file.

I just didn't want to do that in this change since it looks a little bit different after the direct-write-ssts change that is queued up next.

BACKUP writes row data to cloud storage in two ways: 1) the range holding the data can be instructed to export it into a file which it writes to cloud storage directly or 2) the range holding the data can export it into a file which it returns to the BACKUP process which then writes it to cloud storage. When doing the second, we can be smarter than writing each file that is returned to the SQL process immediately and individually, and instead buffer and merge many of those files before writing the merged file to the destination. A future change could then build on this to opportunitically choose to return files if they are smaller than some limit even in we were initially operating in case (1), to make use of this merging and avoid creating large numbers of small files. Release note: none.

dt · 2021-06-23T11:03:13Z

TFTR!

dt · 2021-06-23T13:20:51Z

bors r+

craig · 2021-06-23T14:05:04Z

Build succeeded:

GitHub CI (Cockroach)

dt requested review from pbardea, adityamaru and a team May 21, 2021 20:57

dt force-pushed the stitch-returned branch 4 times, most recently from f6d98e3 to 3ffbab9 Compare May 25, 2021 19:00

pbardea reviewed May 26, 2021

View reviewed changes

pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved

pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved

pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved

adityamaru mentioned this pull request Jun 3, 2021

backupccl: add some span recordings to the backup resumer #65725

Merged

dt force-pushed the stitch-returned branch 3 times, most recently from c71a76a to 0970220 Compare June 22, 2021 13:55

pbardea reviewed Jun 22, 2021

View reviewed changes

pbardea self-requested a review June 22, 2021 14:27

pbardea approved these changes Jun 22, 2021

View reviewed changes

dt force-pushed the stitch-returned branch from 0970220 to b10d7d5 Compare June 22, 2021 18:22

craig bot merged commit 231e1e5 into cockroachdb:master Jun 23, 2021

dt deleted the stitch-returned branch June 23, 2021 14:25

dt mentioned this pull request Jun 30, 2021

backup: incremental backup of little data over many ranges makes many tiny files #44480

Closed

adityamaru mentioned this pull request Oct 19, 2022

cloud: regression in Azure backups #90297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: merge files returned to the processor during backup #65576

backupccl: merge files returned to the processor during backup #65576

dt commented May 21, 2021 •

edited

Loading

cockroach-teamcity commented May 21, 2021

dt commented May 21, 2021

pbardea left a comment

pbardea Jun 22, 2021

dt Jun 22, 2021

dt commented Jun 23, 2021 •

edited

Loading

dt commented Jun 23, 2021

craig bot commented Jun 23, 2021

backupccl: merge files returned to the processor during backup #65576

backupccl: merge files returned to the processor during backup #65576

Conversation

dt commented May 21, 2021 • edited Loading

cockroach-teamcity commented May 21, 2021

dt commented May 21, 2021

pbardea left a comment

Choose a reason for hiding this comment

pbardea Jun 22, 2021

Choose a reason for hiding this comment

dt Jun 22, 2021

Choose a reason for hiding this comment

dt commented Jun 23, 2021 • edited Loading

dt commented Jun 23, 2021

craig bot commented Jun 23, 2021

dt commented May 21, 2021 •

edited

Loading

dt commented Jun 23, 2021 •

edited

Loading