Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

Closed
adityamaru opened this issue Jun 27, 2022 · 2 comments
Closed

backupccl: slow checkpointing could bring BACKUP to a crawl #83456

adityamaru opened this issue Jun 27, 2022 · 2 comments
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Jun 27, 2022

In a running backup, a large number of backup_processors are
responsible for exporting data from the database and writing them
as files (SSTs) to S3.

Inside the backup processor, when we have finished writing a file to
S3, we report that status back to the job coordinator. We do that by
sending a message back to the coordinator. The coordinator receives
that message, updates its local view of the progress, and
periodically checkpoints that progress by writing a file into S3.
The checkpoint allows us to resume the backup in the case of a
transient failure.

If the backup manifest that we checkpoint is very
large then constructing this large checkpoint may be taking considerable time.

While the coordinator is constructing this checkpoint, it is not
reading progress messages sent from the backup processors. In the
happy case, this is fine. We will buffer up to 16 messages between
the backup processors and the coordinator. Further, we only
construct the backup if it has been a minute or more since the last
checkpoint.

However, a recent change that was included in 22.1 altered how we
measure the time since the last checkpoint. Previously, the time
since the last checkpoint did not include the time it took to
construct the checkpoint -- now it does.

As a result, if the checkpoint takes close to a minute or more to
construct, then nearly every progress update results in a checkpoint
being constructed and written. The buffers between the backup
processors and the coordinator can easily fill up during this time.
Once full, the backup processor will only be able to do more work
after at least one slot in the buffer is freed -- which will only
happen once the checkpoint is written.

Jira issue: CRDB-17078

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Jun 27, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 27, 2022

cc @cockroachdb/bulk-io

@adityamaru
Copy link
Contributor Author

Above is a summary written by @stevendanna when investigating a support issue. The immediate bug was fixed by #83151.

Future improvements will be tracked in #83184

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants