backupccl,multi-tenant: BACKUP fails when node is down when using shared process multi-tenancy #111319

stevendanna · 2023-09-26T23:59:10Z

Describe the problem

A user reported being unable to run BACKUP inside a secondary tenant, using mixed process mode when one of the nodes in the cluster was down.

Using demo this is trivially reproducible. After shutting down a node, BACKUPs often fail even well-after the sql instance table should have been cleaned up, with errors such as:

[email protected]:26257/demoapp/movr> backup into 'userfile:///foo';                                                                                                                                              
ERROR: failed to run backup: exporting 28 ranges: failed to resolve n3: unable to look up descriptor for n3: non existent SQL instance

This is my slightly informed speculation about what is going on. BACKUP uses PartitionSpans to distribute work.

PartitionSpans iterates over the given span of work assigning portions of the span to different IDs. It does that via a SpanResolver. The SpanResolver uses an Oracle that is responsible for choosing a replica of a given range. This replica may not be healthy.

When run in mixed-process mode, PartitionSpans uses an instance resolver that simply returns the nodeID as a SQLInstanceID without doing any sort of health-check:

cockroach/pkg/sql/distsql_physical_planner.go

Line 1339 in 96c0868

resolver, err := dsp.makeInstanceResolver(ctx, planCtx.localityFilter)

cockroach/pkg/sql/distsql_physical_planner.go

Lines 1397 to 1399 in 96c0868

    
           func instanceIDForKVNodeHostedInstance(nodeID roachpb.NodeID) base.SQLInstanceID { 
        
           	return base.SQLInstanceID(nodeID) 
        
           }

Note that is different from other code paths in which we do explicitly check the health of a node before using it:

cockroach/pkg/sql/distsql_physical_planner.go

Lines 1379 to 1386 in 96c0868

    
           status := dsp.checkInstanceHealthAndVersionSystem(ctx, planCtx, sqlInstanceID) 
        
           // If the node is unhealthy or its DistSQL version is incompatible, use the 
        
           // gateway to process this span instead of the unhealthy host. An empty 
        
           // address indicates an unhealthy host. 
        
           if status != NodeOK { 
        
           	log.Eventf(ctx, "not planning on node %d: %s", sqlInstanceID, status) 
        
           	sqlInstanceID = dsp.gatewaySQLInstanceID 
        
           }

If I modify the code to do such a healthcheck, the problem goes away.

Jira issue: CRDB-31854

The text was updated successfully, but these errors were encountered:

Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available. This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes. As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down. Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes. An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant. When reading this code, I also noted that we don't do DistSQL version compatibility checks like we do in the SystemTenant case. I am not sure on the impact of that. Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent. Fixes cockroachdb#111319 Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.

blathers-crl · 2023-09-27T16:04:51Z

cc @cockroachdb/disaster-recovery

111337: sql: PartitionSpan should only use healthy nodes in mixed-process mode r=yuzefovich a=stevendanna Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available. This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes. As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down. Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes. An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant. When reading this code, I also noted that we don't do DistSQL version compatibility checks like we do in the SystemTenant case. I am not sure on the impact of that. Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent. Fixes #111319 Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable. 111675: backupccl: deflake TestShowBackup r=stevendanna a=msbutler This patch simplifies how TestShowBackup parses the stringed timestamp: it removes the manual splitting of date and time and parses the stringed timestamp in one call. Fixes: #111015 Release note: none Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Michael Butler <[email protected]>

Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available. This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes. As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down. Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes. An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant. Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent. Fixes cockroachdb#111319 Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.

stevendanna mentioned this issue Sep 27, 2023

sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

Merged

lunevalex added the T-disaster-recovery label Sep 27, 2023

blathers-crl bot added the A-disaster-recovery label Sep 27, 2023

exalate-issue-sync bot assigned stevendanna Oct 2, 2023

craig bot closed this as completed in 44fac37 Oct 4, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl,multi-tenant: BACKUP fails when node is down when using shared process multi-tenancy #111319

backupccl,multi-tenant: BACKUP fails when node is down when using shared process multi-tenancy #111319

stevendanna commented Sep 26, 2023 •

edited

Loading

blathers-crl bot commented Sep 27, 2023

backupccl,multi-tenant: BACKUP fails when node is down when using shared process multi-tenancy #111319

backupccl,multi-tenant: BACKUP fails when node is down when using shared process multi-tenancy #111319

Comments

stevendanna commented Sep 26, 2023 • edited Loading

blathers-crl bot commented Sep 27, 2023

stevendanna commented Sep 26, 2023 •

edited

Loading