Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: mark type descs as dropped if there is a failure in restore #62278

Merged
merged 1 commit into from
Mar 23, 2021

Conversation

arulajmani
Copy link
Collaborator

@arulajmani arulajmani commented Mar 19, 2021

Previously, when cleaning up type descriptors from a failed restore, we
would directly delete the system.descriptor entry. This is bad because
writing directly to the system.descriptor table means we bypass the
descriptor collection, which doesn't know these descriptors have been
dropped.

The descriptor collection validates all uncommitted descriptors before
writing them. As the descriptor collection doesn't know type
descriptors have been dropped, validating cross references was always
bound to fail. Put another way, if a restore from a backup which
contained user defined types failed for any reason, we were bound to
require manual cleanup.

This patch fixes this problem by going through the descriptor
collection and writing the descriptor in DROPPED state in addition to
deleting the system.descriptor entry.

Release note (bug fix): Fixed a bug where a failed restore from a
backup including user defined types would require manual cleanup.

@arulajmani arulajmani requested review from ajwerner, adityamaru and a team March 19, 2021 21:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@arulajmani
Copy link
Collaborator Author

Found this when trying to investigate the cause of #60773 with @ajstorm

@arulajmani arulajmani force-pushed the restore-type-desc branch 4 times, most recently from 5defdf1 to 9900c91 Compare March 19, 2021 23:03
Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM with a minor testing comment.

// Back up the database, drop it, and restore into it.
sqlDB.Exec(t, `BACKUP DATABASE db TO 'nodelocal://0/test/'`)
sqlDB.Exec(t, `DROP DATABASE db`)
sqlDB.ExpectErr(t, "boom", `RESTORE DATABASE db FROM 'nodelocal://0/test/'`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an assertion that the relevant descriptors are in the DROP state?

Copy link
Collaborator Author

@arulajmani arulajmani Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, they're only written in the DROP state for the purpose of the desc collection, we still have to overwrite this and delete the descriptor entry from system.descriptor (as before).

Not quite sure how to test the descriptor entry has been removed successfully, as the ID is unknown and the namespace entry has been removed as well. Are there any existing tests I could take inspiration from?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this sort of thing should have been caught with testing at a different level. Like for all of the tests where we restore from real backups, it feels like we should inject failures at various stages and make sure that those all roll back correctly. For another PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @pbardea)


pkg/ccl/backupccl/backup_test.go, line 6849 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

To clarify, they're only written in the DROP state for the purpose of the desc collection, we still have to overwrite this and delete the descriptor entry from system.descriptor (as before).

Not quite sure how to test the descriptor entry has been removed successfully, as the ID is unknown and the namespace entry has been removed as well. Are there any existing tests I could take inspiration from?

Can you just check that crdb_internal.invalid_objects is empty?

Maybe also validate that without your change that it isn't.

Previously, when cleaning up type descriptors from a failed restore, we
would directly delete the system.descriptor entry. This is bad because
writing directly to the system.descriptor table means we bypass the
descriptor collection, which doesn't know these descriptors have been
dropped.

The descriptor collection validates all uncommitted descriptors before
writing them. As the descriptor collection doesn't know type
descriptors have been dropped, validating cross references was always
bound to fail. Put another way, if a restore from a backup which
contained user defined types failed for any reason, we were bound to
require manual cleanup.

This patch fixes this problem by going through the descriptor
collection and writing the descriptor in DROPPED state in addition to
deleting the system.descriptor entry.

Release note (bug fix): Fixed a bug where a failed restore from a
backup including user defined types would require manual cleanup.
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @pbardea)


pkg/ccl/backupccl/backup_test.go, line 6849 at r1 (raw file):

Previously, ajwerner wrote…

Can you just check that crdb_internal.invalid_objects is empty?

Maybe also validate that without your change that it isn't.

crdb_internal.invalid_objects was empty without my changes as well because when the restore job failed it left behind all descriptor entries (not just the type descriptor), so the descriptor could be validated. I realized system.namespace is fair game though, so that's what I'm doing now.

@arulajmani
Copy link
Collaborator Author

TFTRs!

bors r=ajwerner,pbardea

@craig
Copy link
Contributor

craig bot commented Mar 23, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants