Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: fingerprint mismatch on some 22.2 releases #105900

Closed
renatolabs opened this issue Jun 30, 2023 · 3 comments
Closed

backupccl: fingerprint mismatch on some 22.2 releases #105900

renatolabs opened this issue Jun 30, 2023 · 3 comments
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Jun 30, 2023

Restoring certain backups in certain 22.2 releases might lead to mismatched fingerprints, i.e., the fingerprint of a table after restoring is different from the fingerprints of the table at the time the backup was taken.

Take the cluster backup in gs://cockroach-tmp/backup_issue_22_2_fingerprint_mismatch/backups/3_22.2.6-to-current_cluster_full-planned-and-executed-on-current-incremental-planned-and-executed-on-22.2.6_JtG9 for example. I used the script below to restore this backup on all patch releases of 22.2 and then to compute the fingerprints of the bank.bank table.

Script (set CLUSTER to your liking)

set -o xtrace
set -e

OUT=fingerprints.txt
BACKUP="gs://cockroach-tmp/backup_issue_22_2_fingerprint_mismatch/backups/3_22.2.6-to-current_cluster_full-planned-and-executed-on-current-incremental-planned-and-executed-on-22.2.6_JtG9?AUTH=implicit"
roachprod create -n1 $CLUSTER

for p in $(seq 0 11); do
        roachprod stage $CLUSTER release v22.2.$p
        roachprod start $CLUSTER
        roachprod sql $CLUSTER:1 -- -e "RESTORE FROM LATEST IN '$BACKUP'"
        echo v22.2.$p | tee -a $OUT
        roachprod sql $CLUSTER:1 -- -e "SELECT index_name, fingerprint FROM [SHOW EXPERIMENTAL_FINGERPRINTS FROM TABLE bank.bank] ORDER BY index_name" | tee -a $OUT
        roachprod wipe $CLUSTER
done


LATEST=v23.1.4
roachprod stage $CLUSTER release $LATEST
roachprod start $CLUSTER
roachprod sql $CLUSTER:1 -- -e "RESTORE FROM LATEST IN '$BACKUP'"
echo $LATEST | tee -a $OUT
roachprod sql $CLUSTER:1 -- -e "SELECT index_name, fingerprint FROM [SHOW EXPERIMENTAL_FINGERPRINTS FROM TABLE bank.bank] ORDER BY index_name" | tee -a $OUT

The output of the script above is a fingerprints.txt file that displays the fingerprints of the bank.bank table after restoring that backup for each patch release. Running it on a gceworker, I get the following result:

v22.2.0
  index_name |     fingerprint
-------------+----------------------
  bank_pkey  | 9057563089494184689
(1 row)


Time: 6.139s

v22.2.1
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.057s

v22.2.2
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.181s

v22.2.3
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.170s

v22.2.4
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.171s

v22.2.5
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.372s

v22.2.6
  index_name |     fingerprint
-------------+-----------------------
  bank_pkey  | -7969207273340626758
(1 row)


Time: 6.344s

v22.2.7
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.163s

v22.2.8
  index_name |    fingerprint
-------------+---------------------
  bank_pkey  | 316766736030870651
(1 row)


Time: 6.343s

v22.2.9
  index_name |     fingerprint
-------------+-----------------------
  bank_pkey  | -5791102351434769513
(1 row)


Time: 6.268s

v22.2.10
  index_name |     fingerprint
-------------+-----------------------
  bank_pkey  | -5791102351434769513
(1 row)


Time: 6.213s

v22.2.11
  index_name |     fingerprint
-------------+-----------------------
  bank_pkey  | -5791102351434769513
(1 row)


Time: 6.342s

v23.1.4
  index_name |     fingerprint
-------------+-----------------------
  bank_pkey  | -5791102351434769513
(1 row)

Time: 6.339s

Notes

  • See how 22.2.0 and 22.2.6 stand out, each having a fingerprint that only occurs in that patch release (9057563089494184689 and -7969207273340626758 respectively)
  • 22.2.1 through 22.2.5, 22.2.7, and 22.2.8 share the same fingerprint, 316766736030870651
  • 22.2.9 and above start restoring a fingeprint of -5791102351434769513. This is the expected fingerprint, as it's the fingerprint computed when the backup was taken.

Reproduction

I haven't had the time to do much in terms of making sure this issue is reproducible (saw the issue yesterday night). This issue was uncovered by running the backup-restore/mixed-versions roachtest while I attempted to backport some mixedversion improvements (#105454 and #105231) to release-23.1.

There is a chance the test failure is reproducible by running the test on https://github.com/renatolabs/cockroach/tree/rc/secure-random-backports with random seed -6450315306042512095, though I haven't tried yet.

Final notes

  • There were actually two backups that had a mismatched fingerprint in that test. I haven't verified if we get the same variance in fingerprint for the other backup 6_22.2.6-to-current_cluster_all-planned-and-executed-on-random-node.
  • Needless to say at this point: both problematic backups were taken in mixed-version; unclear if this is relevant to the issue or just incidental.
  • Both backups, along with the roachtest artifacts for the test failure are in this backup_issue_22_2_fingerprint_mismatch bucket.

Jira issue: CRDB-29263

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels Jun 30, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 30, 2023

cc @cockroachdb/disaster-recovery

@rhu713
Copy link
Contributor

rhu713 commented Jul 5, 2023

After looking into this last week, TL;DR is that this doesn't seem to reveal more that what was discussed in the TA that prompted this fix that was backported in v22.2.9.

For this particular backup, the Backup Manifest has a file entry data/878205684195885060.sst with span /Table/177/1/234/0{-}. Before v22.2.9 it seems that the make import spans algorithm just ignored this file. Peeking into the file, it has least one key /Table/177/1/234/0/1688076801.137761586,0 that's not included in restore.

As for the unique fingerprints with the other minor version, I didn't look too deep into this but both versions also skipped this file entry, plus

  • v22.2.0 is the only version that has backup.restore_span.target_size at 0
  • v22.2.6 had the backport which included generative import spans which is a different import span generation algorithm, this backport was reverted in the next version.

@renatolabs
Copy link
Contributor Author

Thanks for looking into this!

Since the TA states that users should be running 22.2.9+ to address the manifest issue mentioned above, I believe we are good. Just wanted to make sure there wasn't something else to look into here, as I was surprised by the different fingerprints on 22.2.0 and 22.2.6.

I'm going to close -- feel free to reopen if you think there's anything more to be done.

renatolabs added a commit to renatolabs/cockroach that referenced this issue Jul 17, 2023
The `backup-restore/mixed-version` roachtest needed a few changes to
work with the use of randomized predecessor patch releases;
specifically, we don't attempt to restore backups taken in
mixed-version in certain older releases because of known issues -- for
more details, see cockroachdb#105900.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants