kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

arulajmani · 2022-05-04T19:14:47Z

Describe the problem

The DistSender sends batch requests that interact with a range to its replicas. In general, it does so by ordering replicas based on the latency between the requesting node and the node on which the replica lives. If the request must be sent to the leaseholder (and the leaseholder is known) the leaseholder is moved to the front of the queue. See:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1951 to 1975 in 4d1f40b

    
           switch ba.RoutingPolicy { 
        
           case roachpb.RoutingPolicy_LEASEHOLDER: 
        
           	// First order by latency, then move the leaseholder to the front of the 
        
           	// list, if it is known. 
        
           	if !ds.dontReorderReplicas { 
        
           		replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), ds.latencyFunc) 
        
           	} 
        
           	idx := -1 
        
           	if leaseholder != nil { 
        
           		idx = replicas.Find(leaseholder.ReplicaID) 
        
           	} 
        
           	if idx != -1 { 
        
           		replicas.MoveToFront(idx) 
        
           		leaseholderFirst = true 
        
           	} else { 
        
           		// The leaseholder node's info must have been missing from gossip when we 
        
           		// created replicas. 
        
           		log.VEvent(ctx, 2, "routing to nearest replica; leaseholder not known") 
        
           	} 
        
           case roachpb.RoutingPolicy_NEAREST: 
        
           	// Order by latency. 
        
           	log.VEvent(ctx, 2, "routing to nearest replica; leaseholder not required") 
        
           	replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), ds.latencyFunc)

For follower read requests, this has the effect that they are always routed to the nearest replica. Unfortunately, this doesn't work quite as intended for secondary tenants. Instead, because a secondary tenant's DistSender is unaware of which node it is running on, it ends up randomly ordering the replica slice because of:

cockroach/pkg/kv/kvclient/kvcoord/replica_slice.go

Lines 205 to 208 in 4d1f40b

    
           if nodeDesc == nil { 
        
           	shuffle.Shuffle(rs) 
        
           	return 
        
           }

To see why a secondary tenant's dist sender is unaware of which node it is running on, it's because this cast fails:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 564 to 568 in cc155a1

    
           // TODO(nvanbenschoten): open an issue about the effect of this. 
        
           g, ok := ds.nodeDescs.(*gossip.Gossip) 
        
           if !ok { 
        
           	return nil 
        
           }

@nvanbenschoten was this the hazard you had in mind in the TODO above, or is there more here?

To Reproduce

I discovered this when trying to convert our follower_reads roachtest to run as secondary tenants. TODO(arul): link a WIP PR.

Expected behavior

Secondary tenants should take network latency into account when routing requests. Consequently, follower reads issued by secondary tenants should be served from the nearest replica (instead of a random one).

Additional context

We'd want to solve this to harness many of the benefits listed in #72593 for secondary tenants.

cc @cockroachdb/kv

Jira issue: CRDB-15402

Epic: CRDB-14202

The text was updated successfully, but these errors were encountered:

knz · 2022-05-05T12:49:14Z

A request when investigating this: be careful to highlight the problem exists with "secondary tenant running as separate SQL-only servers".

For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.

nvanbenschoten · 2022-06-22T19:36:42Z

For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm.

By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?

knz · 2022-06-22T19:44:14Z

By this do you mean that these secondary tenants will be aware of a local NodeDescriptor?

We don't need the entire node descriptor though? just the ID and locality attributes? We're already planning to include those in the sql_livness table.

andy-kimball · 2022-06-23T20:23:39Z

We'll definitely need this fixed before we can make MR Serverless a reality. It's a "ship-stopper" issue.

nvanbenschoten · 2022-07-27T14:07:32Z

This looks straightforward. The fix will involve a bit of plumbing and a generalization of ReplicaSlice.OptimizeReplicaOrder. My suggestion is to change the signature of that method from:

func (rs ReplicaSlice) OptimizeReplicaOrder(nodeDesc *roachpb.NodeDescriptor, latencyFn LatencyFunc)

to

// nodeID can be 0, in which case it is ignored
// latencyFn can be nil, in which case it will not be used
func (rs ReplicaSlice) OptimizeReplicaOrder(nodeID roachpb.NodeID, latencyFn LatencyFunc, locality roachpb.Locality)

Once we have that,

use Gossip if available to grab the client's NodeID (DistSender.getNodeID)
get rid of DistSender.getNodeDescriptor
finally, plumb in the local process's Locality into the DistSender (cfg.Locality in server.NewServer and server.makeTenantSQLServerArgs)

The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.

The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. It also shouldn't be a concern given tenant SQL pods don't run in process with KV nodes. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.

85853: kv: ensure secondary tenants route follower reads to the closest replica r=arulajmani a=arulajmani The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. Resolves #81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer. 85878: gcjob: issue DeleteRange tombstones and then wait for GC r=ajwerner a=ajwerner Note that this does not change anything about tenant GC. Fixes #70427 Release note (sql change): The asynchronous garbage collection process has been changed such that very soon after dropping a table, index, or database, or after refreshing a materialized view, the system will issue range deletion tombstones over the dropped data. These tombstones will result in the KV statistics properly counting these bytes as garbage. Before this change, the asynchronous "gc job" would wait out the TTL and then issue a lower-level operation to clear out the data. That meant that while the job was waiting out the TTL, the data would appear in the statistics to still be live. This was confusing. Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

arulajmani added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 4, 2022

blathers-crl bot added the T-kv KV Team label May 4, 2022

knz added the A-multitenancy Related to multi-tenancy label May 5, 2022

jlinder added sync-me and removed sync-me labels May 20, 2022

arulajmani self-assigned this Jul 27, 2022

kvoli mentioned this issue Jul 27, 2022

kvserver: account for no node id in leaseholder stats #85157

Closed

arulajmani mentioned this issue Aug 9, 2022

kv: ensure secondary tenants route follower reads to the closest replica #85853

Merged

craig bot closed this as completed in 3d87dde Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

arulajmani commented May 4, 2022 •

edited by knz

Loading

knz commented May 5, 2022

nvanbenschoten commented Jun 22, 2022

knz commented Jun 22, 2022

andy-kimball commented Jun 23, 2022

nvanbenschoten commented Jul 27, 2022 •

edited

Loading

kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000

Comments

arulajmani commented May 4, 2022 • edited by knz Loading

knz commented May 5, 2022

nvanbenschoten commented Jun 22, 2022

knz commented Jun 22, 2022

andy-kimball commented Jun 23, 2022

nvanbenschoten commented Jul 27, 2022 • edited Loading

arulajmani commented May 4, 2022 •

edited by knz

Loading

nvanbenschoten commented Jul 27, 2022 •

edited

Loading