Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: breakup the txn that inserts stats during cluster restore #75969

Merged
merged 1 commit into from
Feb 12, 2022

Conversation

adityamaru
Copy link
Contributor

We have seen instances of restores with hundreds of tables getting
stuck on inserting the backed up table stats into the system.table_stats
table on the restoring cluster. Previously, we would issue insert
statements for each table stat row in a single, long-running txn. If this
txn were to be retried a few times, we would observe intent buildup
on the system.table_stats ranges. Once these intents exceeded the
max_intent_bytes on the cluster, every subsequent txn retry would fall
back to the much more expensive ranged intent resolution. The only
remedy at this point would be to delete the BACKUP-STATISTICS file from
the bucket where the backup resides, and restore the tables with no
stats, relying on the AUTO STATS job to rebuild them gradually.

This change "batches" the insertion of the table stats to prevent the
above situation.

Fixes: #69207

Release note: None

@adityamaru adityamaru requested review from dt, stevendanna and a team February 3, 2022 18:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

I can add some checkpointing to make sure we only insert table stats once, but that would involve a proto change. I wanted to get yall temperature on backporting this form of the fix.

@adityamaru
Copy link
Contributor Author

Hmmm for a second I thought that breaking up the txn means that we could have a PK collision if the job were to be re-resumed. The stats table has a PK on statsID, tableID, but fortunately, the InsertNewStat method does not insert the backed up statsID, and relies on its default unique_row_id. So this change is still safe.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me.

pkg/ccl/backupccl/backup_test.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
We have seen instances of restores with hundreds of tables getting
stuck on inserting the backed up table stats into the system.table_stats
table on the restoring cluster. Previously, we would issue insert
statements for each table stat row in a single, long-running txn. If this
txn were to be retried a few times, we would observe intent buildup
on the system.table_stats ranges. Once these intents exceeded the
`max_intent_bytes` on the cluster, every subsequent txn retry would fall
back to the much more expensive ranged intent resolution. The only
remedy at this point would be delete the BACKUP-STATISTICS file from
the bucket where the backup resides, and restore the tables with no
stats, relying on the AUTO STATS job to rebuild them gradually.

This change "batches" the insertion of the table stats to prevent the
above situation.

Fixes: cockroachdb#69207

Release note: None
@adityamaru
Copy link
Contributor Author

TFTR!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Feb 12, 2022

Build succeeded:

@craig craig bot merged commit 260be01 into cockroachdb:master Feb 12, 2022
@NikhilSharma-NS

This comment was marked as off-topic.

@adityamaru adityamaru deleted the split-stats-txn branch February 14, 2022 17:11
@adityamaru
Copy link
Contributor Author

blathers backport 21.2

@blathers-crl
Copy link

blathers-crl bot commented Feb 15, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 568c463 to blathers/backport-release-21.2-75969: POST https://api.github.com/repos/cockroachlabs/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RESTORE: Restoring TableStats is slow/stuck
4 participants