backupccl: slow checkpointing could bring BACKUP to a crawl #83456
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
In a running backup, a large number of backup_processors are
responsible for exporting data from the database and writing them
as files (SSTs) to S3.
Inside the backup processor, when we have finished writing a file to
S3, we report that status back to the job coordinator. We do that by
sending a message back to the coordinator. The coordinator receives
that message, updates its local view of the progress, and
periodically checkpoints that progress by writing a file into S3.
The checkpoint allows us to resume the backup in the case of a
transient failure.
If the backup manifest that we checkpoint is very
large then constructing this large checkpoint may be taking considerable time.
While the coordinator is constructing this checkpoint, it is not
reading progress messages sent from the backup processors. In the
happy case, this is fine. We will buffer up to 16 messages between
the backup processors and the coordinator. Further, we only
construct the backup if it has been a minute or more since the last
checkpoint.
However, a recent change that was included in 22.1 altered how we
measure the time since the last checkpoint. Previously, the time
since the last checkpoint did not include the time it took to
construct the checkpoint -- now it does.
As a result, if the checkpoint takes close to a minute or more to
construct, then nearly every progress update results in a checkpoint
being constructed and written. The buffers between the backup
processors and the coordinator can easily fill up during this time.
Once full, the backup processor will only be able to do more work
after at least one slot in the buffer is freed -- which will only
happen once the checkpoint is written.
Jira issue: CRDB-17078
The text was updated successfully, but these errors were encountered: