backup: incremental backup of little data over many ranges makes many tiny files #44480

dt · 2020-01-29T14:17:38Z

Currently backup asks each range to export its content (or changes to its content, for incremental) backups to a file (or multiple files) in the destination storage. Each individual file however has some fixed overhead: even with just a single small key, the SST metadata trailer, potentially encryption headers, etc add up to a minimum SST size, plus there is any minimum size enforced by the underlying filesystem/storage provider e.g. S3 bills each object as if it has a minimum 128kb size. Additionally, there is overhead in the backup/restore process in tracking/sorting/handling the metadata that scales with number of files.

In the push to run more frequent incremental backups, we expect to see the amount of data changed in any given range get much smaller as we shrink the time-window, so we could start seeing far more small files, and for large range-count clusters, the wasted overhead could become significant.

We may want to a) try to quantify this (e.g. what does the output of 1min of backup from the 12th hour of tpcc50k running look like?) and if we determine it is a problem, investigate ways to combine small change outputs from multiple ranges into fewer, larger files that minimize overhead. Currently ranges directly write their files to the destination (as there was no other way to do distributed work when backup was written) but we could add an intermediate distsql processor (see #40239) , scheduling one on each node, which could aggregate the results of exporting multiple small ranges on that node before writing a file (potentially including a "return SST instead of writing if smaller than x" flag to ExportRequest to avoid the extra hop for big files).

Epic CRDB-7078

dt · 2021-06-30T14:56:48Z

Fixed by #66856 (and #65576, #66876, and #66802)

dt added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery labels Jan 29, 2020

dt mentioned this issue Feb 25, 2020

backup: consider using distsql for work distribution #40239

Closed

kenliu added the T-disaster-recovery label Dec 5, 2020

dt self-assigned this Jun 2, 2021

This was referenced Jun 14, 2021

backup: slow listing during incremental backup planning and restore #66448

Closed

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

Closed

dt closed this as completed Jun 30, 2021

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup: incremental backup of little data over many ranges makes many tiny files #44480

backup: incremental backup of little data over many ranges makes many tiny files #44480

dt commented Jan 29, 2020 •

edited by shermanCRL

Loading

dt commented Jun 30, 2021

backup: incremental backup of little data over many ranges makes many tiny files #44480

backup: incremental backup of little data over many ranges makes many tiny files #44480

Comments

dt commented Jan 29, 2020 • edited by shermanCRL Loading

dt commented Jun 30, 2021

dt commented Jan 29, 2020 •

edited by shermanCRL

Loading