-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: secondary tenants do not take network latency into account when routing batch requests #81000
Comments
A request when investigating this: be careful to highlight the problem exists with "secondary tenant running as separate SQL-only servers". For dedicated/SH we will have secondary tenants running in-process with KV, with a 1:1 relationship with the KV node. In that case, we will be able to use the base algorithm. |
By this do you mean that these secondary tenants will be aware of a local NodeDescriptor? |
We don't need the entire node descriptor though? just the ID and locality attributes? We're already planning to include those in the sql_livness table. |
We'll definitely need this fixed before we can make MR Serverless a reality. It's a "ship-stopper" issue. |
This looks straightforward. The fix will involve a bit of plumbing and a generalization of func (rs ReplicaSlice) OptimizeReplicaOrder(nodeDesc *roachpb.NodeDescriptor, latencyFn LatencyFunc) to // nodeID can be 0, in which case it is ignored
// latencyFn can be nil, in which case it will not be used
func (rs ReplicaSlice) OptimizeReplicaOrder(nodeID roachpb.NodeID, latencyFn LatencyFunc, locality roachpb.Locality) Once we have that,
|
The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.
The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.
The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. It also shouldn't be a concern given tenant SQL pods don't run in process with KV nodes. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.
The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. It also shouldn't be a concern given tenant SQL pods don't run in process with KV nodes. Resolves cockroachdb#81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer.
85853: kv: ensure secondary tenants route follower reads to the closest replica r=arulajmani a=arulajmani The dist sender uses node locality information to rank replicas of a range by latency. Previously, this node locality information was read off a node descriptor available in Gossip. Unfortunately, secondary tenants do not have access to Gossip, and as such, would end up randomizing this list of replicas. This manifested itself through unpredictable latencies when running follower reads. We're no longer susceptible to this hazard with this patch. This is done by eschewing the need of a node descriptor from gossip in the DistSender; instead, we now instantiate the DistSender with locality information. However, we do still use Gossip to get the current node's ID when ranking replicas. This is done to ascertain if there is a local replica, and if there is, to always route to it. Unfortunately, because secondary tenants don't have access to Gossip, they can't conform to these semantics. They're susceptible to a hazard where a request may be routed to another replica in the same locality tier as the client even though the client has a local replica as well. This shouldn't be a concern in practice given the diversity heuristic. Resolves #81000 Release note (bug fix): fix an issue where secondary tenants could route follower reads to a random, far away replica instead of one closer. 85878: gcjob: issue DeleteRange tombstones and then wait for GC r=ajwerner a=ajwerner Note that this does not change anything about tenant GC. Fixes #70427 Release note (sql change): The asynchronous garbage collection process has been changed such that very soon after dropping a table, index, or database, or after refreshing a materialized view, the system will issue range deletion tombstones over the dropped data. These tombstones will result in the KV statistics properly counting these bytes as garbage. Before this change, the asynchronous "gc job" would wait out the TTL and then issue a lower-level operation to clear out the data. That meant that while the job was waiting out the TTL, the data would appear in the statistics to still be live. This was confusing. Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Describe the problem
The
DistSender
sends batch requests that interact with a range to its replicas. In general, it does so by ordering replicas based on the latency between the requesting node and the node on which the replica lives. If the request must be sent to the leaseholder (and the leaseholder is known) the leaseholder is moved to the front of the queue. See:cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 1951 to 1975 in 4d1f40b
For follower read requests, this has the effect that they are always routed to the nearest replica. Unfortunately, this doesn't work quite as intended for secondary tenants. Instead, because a secondary tenant's
DistSender
is unaware of which node it is running on, it ends up randomly ordering the replica slice because of:cockroach/pkg/kv/kvclient/kvcoord/replica_slice.go
Lines 205 to 208 in 4d1f40b
To see why a secondary tenant's dist sender is unaware of which node it is running on, it's because this cast fails:
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 564 to 568 in cc155a1
@nvanbenschoten was this the hazard you had in mind in the TODO above, or is there more here?
To Reproduce
I discovered this when trying to convert our
follower_reads
roachtest to run as secondary tenants. TODO(arul): link a WIP PR.Expected behavior
Secondary tenants should take network latency into account when routing requests. Consequently, follower reads issued by secondary tenants should be served from the nearest replica (instead of a random one).
Additional context
We'd want to solve this to harness many of the benefits listed in #72593 for secondary tenants.
cc @cockroachdb/kv
Jira issue: CRDB-15402
Epic: CRDB-14202
The text was updated successfully, but these errors were encountered: