backupccl: breakup the txn that inserts stats during cluster restore #75969

adityamaru · 2022-02-03T18:45:54Z

We have seen instances of restores with hundreds of tables getting
stuck on inserting the backed up table stats into the system.table_stats
table on the restoring cluster. Previously, we would issue insert
statements for each table stat row in a single, long-running txn. If this
txn were to be retried a few times, we would observe intent buildup
on the system.table_stats ranges. Once these intents exceeded the
max_intent_bytes on the cluster, every subsequent txn retry would fall
back to the much more expensive ranged intent resolution. The only
remedy at this point would be to delete the BACKUP-STATISTICS file from
the bucket where the backup resides, and restore the tables with no
stats, relying on the AUTO STATS job to rebuild them gradually.

This change "batches" the insertion of the table stats to prevent the
above situation.

Fixes: #69207

Release note: None

cockroach-teamcity · 2022-02-03T18:46:01Z

This change is

adityamaru · 2022-02-03T18:50:31Z

I can add some checkpointing to make sure we only insert table stats once, but that would involve a proto change. I wanted to get yall temperature on backporting this form of the fix.

adityamaru · 2022-02-04T17:39:34Z

Hmmm for a second I thought that breaking up the txn means that we could have a PK collision if the job were to be re-resumed. The stats table has a PK on statsID, tableID, but fortunately, the InsertNewStat method does not insert the backed up statsID, and relies on its default unique_row_id. So this change is still safe.

stevendanna

Overall it looks good to me.

pkg/ccl/backupccl/backup_test.go

pkg/ccl/backupccl/restore_job.go

We have seen instances of restores with hundreds of tables getting stuck on inserting the backed up table stats into the system.table_stats table on the restoring cluster. Previously, we would issue insert statements for each table stat row in a single, long-running txn. If this txn were to be retried a few times, we would observe intent buildup on the system.table_stats ranges. Once these intents exceeded the `max_intent_bytes` on the cluster, every subsequent txn retry would fall back to the much more expensive ranged intent resolution. The only remedy at this point would be delete the BACKUP-STATISTICS file from the bucket where the backup resides, and restore the tables with no stats, relying on the AUTO STATS job to rebuild them gradually. This change "batches" the insertion of the table stats to prevent the above situation. Fixes: cockroachdb#69207 Release note: None

adityamaru · 2022-02-12T21:51:20Z

TFTR!

bors r=stevendanna

craig · 2022-02-12T22:42:52Z

Build succeeded:

GitHub CI (Cockroach)

adityamaru · 2022-02-15T15:23:59Z

blathers backport 21.2

blathers-crl · 2022-02-15T15:24:04Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 568c463 to blathers/backport-release-21.2-75969: POST https://api.github.com/repos/cockroachlabs/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

adityamaru requested review from dt, stevendanna and a team February 3, 2022 18:45

adityamaru force-pushed the split-stats-txn branch from 74ed9bd to 9f5843e Compare February 3, 2022 21:24

stevendanna added the T-disaster-recovery label Feb 7, 2022

stevendanna approved these changes Feb 11, 2022

View reviewed changes

pkg/ccl/backupccl/backup_test.go Show resolved Hide resolved

pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved

adityamaru force-pushed the split-stats-txn branch from 9f5843e to 568c463 Compare February 12, 2022 15:16

craig bot merged commit 260be01 into cockroachdb:master Feb 12, 2022

adityamaru added the backport-21.2.x label Feb 12, 2022

This comment was marked as off-topic.

Sign in to view

adityamaru deleted the split-stats-txn branch February 14, 2022 17:11

adityamaru mentioned this pull request May 29, 2022

release-21.2: backupccl: breakup the txn that inserts stats during cluster restore #82049

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: breakup the txn that inserts stats during cluster restore #75969

backupccl: breakup the txn that inserts stats during cluster restore #75969

adityamaru commented Feb 3, 2022

cockroach-teamcity commented Feb 3, 2022

adityamaru commented Feb 3, 2022

adityamaru commented Feb 4, 2022

stevendanna left a comment

adityamaru commented Feb 12, 2022

craig bot commented Feb 12, 2022

This comment was marked as off-topic.

adityamaru commented Feb 15, 2022

blathers-crl bot commented Feb 15, 2022

backupccl: breakup the txn that inserts stats during cluster restore #75969

backupccl: breakup the txn that inserts stats during cluster restore #75969

Conversation

adityamaru commented Feb 3, 2022

cockroach-teamcity commented Feb 3, 2022

adityamaru commented Feb 3, 2022

adityamaru commented Feb 4, 2022

stevendanna left a comment

Choose a reason for hiding this comment

adityamaru commented Feb 12, 2022

craig bot commented Feb 12, 2022

This comment was marked as off-topic.

adityamaru commented Feb 15, 2022

blathers-crl bot commented Feb 15, 2022