-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: backup-restore/online-restore failed [cluster backup missing sequence] #130778
Comments
Looks like a full cluster restore failed to restore a full backup with an endtime of
|
Reproduced this failure with regular restore. we've got a bug in our descriptor rewrite rules somewere. let's figure out what kind of descriptor we're seeking to restore.
|
We can dump all the descriptor ids in the backup via:
which reveals:
I see we added a sequence to this table in schema change logs:
Looks like the owning column is indeed the 5th column in this table:
In addition we altered this sequence at the client timestamp of
|
I may to need input from foundations to proceed, but i can explain what i'm seeing:
On the cluster backup side, we simply grab all cluster descriptors with this helper function: Here's a timeline:
At some time later, a full cluster restore of the backup failed with: i.e. at the time of the backup, a schema change associated with the sequence failed (id 255).
i.e. right here: So the |
@msbutler So, the sequence wasn't backed up because its in a dropped state, so we aren't able to rewrite it. The plan for DROP column with a sequence owner will put the sequence in a dropped state and then clean up the sequence later. The backup process I think skips dropped descriptors, so we can't repair the schema changer state. What do you think of allowing backup of dropped descriptors if they have the schema changer state set? We don't have to back up their data, but it would fix weird cases like this. The only other alternative is repairing the schema changer state some how to drop these references. |
@fqazi thanks for looking into this! The backup process does indeed skip backing up dropped descriptors.
Would we need to restore these dropped descriptors as well? If so, do you envision any weirdness around restoring a dropped table descriptor, but none of its data? Also, out of curiosity, how did you infer that the sequence was in a dropped state? |
I generated a similar plan to a DROP COLUMN when a sequence owner exists. i.e.
Its safer to restore the dropped descriptors. The main weirdness will be that the object in question will be empty if the schema change plan fails with a rollback. The user can't access the data anymore if the plan will be completed successfully. |
Oh wait, you're saying that once a table descriptor's state is If a dropped descriptor's schema changer state is set, will all of its child descriptors also have its schema change state set as well? I'm thinking about your proposal
And if we could end up restoring a broken schema graph if the schema change rolls back after the restore. |
Explain for this scenario:
|
handing over to foundations, as the fix requires manual changes to reconstructing the drop column schema change on restore. |
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: cockroachdb#130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN where the column was sequence owner. The restore would fail with: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: cockroachdb#130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN where the column was sequence owner. The restore would fail with: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: cockroachdb#130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: cockroachdb#130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: cockroachdb#130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: cockroachdb#130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
132202: sql/schemachanger: clean up SequenceOwner elements during restore r=fqazi a=fqazi Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..." Co-authored-by: Faizan Qazi <[email protected]>
132166: rac2,kvserver: do not quiesce if send tokens held r=sumeerbhola a=pav-kv This PR prevents range quiescence if RACv2 holds any send tokens for this range. Quiescence would prevent `MsgApp` pings which ensure that the leader reliably learns about the follower store admitting log entries, and causes it to release tokens accordingly. We do not want to end up holding tokens permanently. Resolves #129581 132202: sql/schemachanger: clean up SequenceOwner elements during restore r=fqazi a=fqazi Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..." Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
132202: sql/schemachanger: clean up SequenceOwner elements during restore r=fqazi a=fqazi Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..." 132234: stringer: add license header to stringer-generated files r=rickystewart a=jlinder Some string-generated files did not contain license headers. This change adds license headers to these files. Part of RE-658 Release note: None Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: James H. Linder <[email protected]>
132129: roachtest: add slow disk perturbation test r=kvoli a=andrewbaptist This change adds a new set of perturbation tests perturbation/*/slowDisk which tests slow disks. We have see support cases where slow disks can cause cluster level availability outages. Epic: none Release note: None 132166: rac2,kvserver: do not quiesce if send tokens held r=sumeerbhola a=pav-kv This PR prevents range quiescence if RACv2 holds any send tokens for this range. Quiescence would prevent `MsgApp` pings which ensure that the leader reliably learns about the follower store admitting log entries, and causes it to release tokens accordingly. We do not want to end up holding tokens permanently. Resolves #129581 132202: sql/schemachanger: clean up SequenceOwner elements during restore r=fqazi a=fqazi Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..." Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
132129: roachtest: add slow disk perturbation test r=kvoli a=andrewbaptist This change adds a new set of perturbation tests perturbation/*/slowDisk which tests slow disks. We have see support cases where slow disks can cause cluster level availability outages. Epic: none Release note: None 132166: rac2,kvserver: do not quiesce if send tokens held r=sumeerbhola a=pav-kv This PR prevents range quiescence if RACv2 holds any send tokens for this range. Quiescence would prevent `MsgApp` pings which ensure that the leader reliably learns about the follower store admitting log entries, and causes it to release tokens accordingly. We do not want to end up holding tokens permanently. Resolves #129581 132202: sql/schemachanger: clean up SequenceOwner elements during restore r=fqazi a=fqazi Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..." Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
132166: rac2,kvserver: do not quiesce if send tokens held r=sumeerbhola a=pav-kv This PR prevents range quiescence if RACv2 holds any send tokens for this range. Quiescence would prevent `MsgApp` pings which ensure that the leader reliably learns about the follower store admitting log entries, and causes it to release tokens accordingly. We do not want to end up holding tokens permanently. Resolves #129581 132202: sql/schemachanger: clean up SequenceOwner elements during restore r=fqazi a=fqazi Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..." Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
Based on the specified backports for linked PR #132202, I applied the following new label(s) to this issue: branch-release-23.1, branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
Previously, when restoring a backup taken in middle of a DROP COLUMN, where a column had a sequence owner assigned, it was possible for the backup to be unrestorable. This would happen because the sequence reference would have been dropped in the plan, but the seqeunce owner element was still within the state. To address this, this test updates the rewrite logic to clean up any SequenceOwner elements which have the referenced sequence already removed. Fixes: #130778 Release note (bug fix): Addressed a rare bug that could prevent backups taken during a DROP COLUMN operation with a sequence owner from restoring with the error: "rewriting descriptor ids: missing rewrite for <id> in SequenceOwner..."
roachtest.backup-restore/online-restore failed with artifacts on master @ 128fcab4c07413513a05aea1d1494943f4bc3092:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-42230
The text was updated successfully, but these errors were encountered: