backupccl: slow checkpointing could bring BACKUP to a crawl #83456

adityamaru · 2022-06-27T20:05:05Z

In a running backup, a large number of backup_processors are
responsible for exporting data from the database and writing them
as files (SSTs) to S3.

Inside the backup processor, when we have finished writing a file to
S3, we report that status back to the job coordinator. We do that by
sending a message back to the coordinator. The coordinator receives
that message, updates its local view of the progress, and
periodically checkpoints that progress by writing a file into S3.
The checkpoint allows us to resume the backup in the case of a
transient failure.

If the backup manifest that we checkpoint is very
large then constructing this large checkpoint may be taking considerable time.

While the coordinator is constructing this checkpoint, it is not
reading progress messages sent from the backup processors. In the
happy case, this is fine. We will buffer up to 16 messages between
the backup processors and the coordinator. Further, we only
construct the backup if it has been a minute or more since the last
checkpoint.

However, a recent change that was included in 22.1 altered how we
measure the time since the last checkpoint. Previously, the time
since the last checkpoint did not include the time it took to
construct the checkpoint -- now it does.

As a result, if the checkpoint takes close to a minute or more to
construct, then nearly every progress update results in a checkpoint
being constructed and written. The buffers between the backup
processors and the coordinator can easily fill up during this time.
Once full, the backup processor will only be able to do more work
after at least one slot in the buffer is freed -- which will only
happen once the checkpoint is written.

Jira issue: CRDB-17078

blathers-crl · 2022-06-27T20:05:08Z

cc @cockroachdb/bulk-io

adityamaru · 2022-06-27T20:06:37Z

Above is a summary written by @stevendanna when investigating a support issue. The immediate bug was fixed by #83151.

Future improvements will be tracked in #83184

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Jun 27, 2022

adityamaru assigned stevendanna Jun 27, 2022

blathers-crl bot added the T-disaster-recovery label Jun 27, 2022

adityamaru closed this as completed Jun 27, 2022

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

adityamaru commented Jun 27, 2022 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Jun 27, 2022

adityamaru commented Jun 27, 2022

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

Comments

adityamaru commented Jun 27, 2022 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Jun 27, 2022

adityamaru commented Jun 27, 2022

adityamaru commented Jun 27, 2022 •

edited by cockroach-jira-scripts

Loading