-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql,backupccl: failed to WaitForNoVersion on deleted descriptor during restore cancellation #88913
Comments
cc @cockroachdb/disaster-recovery |
We could solve this in the lease manager without too much pain. |
Something like this: diff --git a/pkg/sql/catalog/lease/lease.go b/pkg/sql/catalog/lease/lease.go
index 13ebcc0a1f8..e63b3d71162 100644
--- a/pkg/sql/catalog/lease/lease.go
+++ b/pkg/sql/catalog/lease/lease.go
@@ -1072,10 +1072,19 @@ func (m *Manager) findDescriptorState(id descpb.ID, create bool) *descriptorStat
// RangefeedLeases is not active.
func (m *Manager) RefreshLeases(ctx context.Context, s *stop.Stopper, db *kv.DB) {
descUpdateCh := make(chan *descpb.Descriptor)
+ descDeleteCh := make(chan descpb.ID)
m.watchForUpdates(ctx, descUpdateCh)
_ = s.RunAsyncTask(ctx, "refresh-leases", func(ctx context.Context) {
for {
select {
+ case id := <-descDeleteCh:
+ // Try to refresh the lease to one >= this version.
+ log.VEventf(ctx, 2, "purging old version of delete descriptor %d",
+ id)
+ if err := purgeOldVersions(ctx, db, id, true /* dropped */, 0, m); err != nil {
+ log.Warningf(ctx, "error purging leases for descriptor %d: %s",
+ id, err)
+ }
case desc := <-descUpdateCh:
// NB: We allow nil descriptors to be sent to synchronize the updating of
// descriptors.
@@ -1117,7 +1126,9 @@ func (m *Manager) RefreshLeases(ctx context.Context, s *stop.Stopper, db *kv.DB)
// watchForUpdates will watch a rangefeed on the system.descriptor table for
// updates.
-func (m *Manager) watchForUpdates(ctx context.Context, descUpdateCh chan<- *descpb.Descriptor) {
+func (m *Manager) watchForUpdates(
+ ctx context.Context, descUpdateCh chan<- *descpb.Descriptor, deletedDescCh chan<- descpb.ID,
+) {
if log.V(1) {
log.Infof(ctx, "using rangefeeds for lease manager updates")
}
@@ -1130,6 +1141,11 @@ func (m *Manager) watchForUpdates(ctx context.Context, descUpdateCh chan<- *desc
ctx context.Context, ev *roachpb.RangeFeedValue,
) {
if len(ev.Value.RawBytes) == 0 {
+ id := 0 // decode ID from key
+ select {
+ case <-ctx.Done():
+ case deletedDescCh <- id:
+ }
return
}
b, err := descbuilder.FromSerializedValue(&ev.Value) |
I think that solution would probably also resolve at least the main symptom of #89079, but there it still seems wrong to not put the descriptors offline first. |
Was the descriptor ever online? I think part of what's going on here is that we used to not lease descriptors in the |
Unfortunately, I think technically it can be. Because this happens in OnFailOrCancel, we might be responding to a failure that happened after we published the descriptors. |
In that case, I'd prefer we go to dropped if we can muster the will to pull that off |
Full cluster restore drops the default DB. The test driver cache connections that may have originally connected to a database that is now dropped. This causes problems for queries issued after the full cluster restore. Here, (1) I change the query we use to get job IDs to one that doesn't depend on doing any search path lookups and (2) reset all of our connections after the first restore we do. See also cockroachdb#88913 Fixes cockroachdb#119079 Release note: None
119342: backupccl: deflake TestDataDriven_restore_on_fail_or_cancel_retry r=msbutler a=stevendanna Full cluster restore drops the default DB. The test driver cache connections that may have originally connected to a database that is now dropped. This causes problems for queries issued after the full cluster restore. Here, (1) I change the query we use to get job IDs to one that doesn't depend on doing any search path lookups and (2) reset all of our connections after the first restore we do. See also #88913 Fixes #119079 Release note: None Co-authored-by: Steven Danna <[email protected]>
Full cluster restore drops the default DB. The test driver cache connections that may have originally connected to a database that is now dropped. This causes problems for queries issued after the full cluster restore. Here, (1) I change the query we use to get job IDs to one that doesn't depend on doing any search path lookups and (2) reset all of our connections after the first restore we do. See also #88913 Fixes #119079 Release note: None
Describe the problem
When attempting to write a datadriven test that runs a query after a cancelled cluster restore, I see the following under stress:
Details
It's unclear to me if this is a bug on the schema side or if we are holding it wrong (I suppose sometime soon both side of this will be the schema side).
During OnFailOrCancel, we go through all the databases and delete them, and along the way we try to make sure to add them to uncommitted descriptors:
cockroach/pkg/ccl/backupccl/restore_job.go
Lines 2525 to 2542 in 85b29ae
But for 2 of these databases, we are going to recreate them later:
cockroach//pkg/ccl/backupccl/restore_job.go
Lines 2232 to 2244 in 85b29ae
The descriptor collection tracks uncommitted descriptors in a
nstree.NameMap
. When you upsert into this structure, items get replaced based on both name and ID. I've confirmed with some injected logging confirms that the create statement fordefaultdb
andpostgres
end up removing the deleted descriptor with the same name from uncommitted descriptors. This results in us not callingWaitForNoVersion
on the deleted descriptors.As a result, when a query I am trying to run after the backup is cancelled runs, the test occasionally end up trying to do function resolution using an erroneously still leased database descriptor which ends up in us trying to look up a non-existent schema.
Jira issue: CRDB-20034
The text was updated successfully, but these errors were encountered: