backup: incremental backup of little data over many ranges makes many tiny files #44480
Labels
A-disaster-recovery
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Currently backup asks each range to export its content (or changes to its content, for incremental) backups to a file (or multiple files) in the destination storage. Each individual file however has some fixed overhead: even with just a single small key, the SST metadata trailer, potentially encryption headers, etc add up to a minimum SST size, plus there is any minimum size enforced by the underlying filesystem/storage provider e.g. S3 bills each object as if it has a minimum 128kb size. Additionally, there is overhead in the backup/restore process in tracking/sorting/handling the metadata that scales with number of files.
In the push to run more frequent incremental backups, we expect to see the amount of data changed in any given range get much smaller as we shrink the time-window, so we could start seeing far more small files, and for large range-count clusters, the wasted overhead could become significant.
We may want to a) try to quantify this (e.g. what does the output of 1min of backup from the 12th hour of tpcc50k running look like?) and if we determine it is a problem, investigate ways to combine small change outputs from multiple ranges into fewer, larger files that minimize overhead. Currently ranges directly write their files to the destination (as there was no other way to do distributed work when backup was written) but we could add an intermediate distsql processor (see #40239) , scheduling one on each node, which could aggregate the results of exporting multiple small ranges on that node before writing a file (potentially including a "return SST instead of writing if smaller than x" flag to ExportRequest to avoid the extra hop for big files).
Epic CRDB-7078
The text was updated successfully, but these errors were encountered: