sql: latency stats cause large jumps in tail latency in multi-region clusters #59093
Labels
A-multiregion
Related to multi-region
A-sql-execution
Relating to SQL execution.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
In #55705, we added network latency stats to EXPLAIN ANALYZE diagrams, which was a very cool observability improvement! With that change, users of EXPLAIN ANALYZE can now better understand the latency between the nodes in their dataflow graph. This is a win for single-region clusters and an even larger win for multi-region clusters.
Unfortunately, the change also seems to be causing some issues. In recent multi-region demos, we've seen that follower reads periodically jump from 1-2ms up to 80-160ms. We tracked this down to the call that the newly introduced
LatencyGetter.GetLatency
performs through theNodesStatusServer.Nodes
every 5 seconds. This operation performs a synchronous, consistent read of the/System/StatusNode
table to update cached latency figures.This will clearly not work well in a multi-region cluster, where this
/System/StatusNode
table may be stored in a different region. In fact, it's a flawed approach even for single-region clusters. We should not be performing any expensive work synchronously during the operation of a query that isn't strictly necessary to service the query. Additionally, we certainly shouldn't be doing so while holding an exclusive lock that will block every other statement on the expensive work. Doing so will cause a spike in latency for all queries every 5 seconds, driving up tail latencies.If we want to keep this latency map relatively up to date, I'd suggest we introduce some background process that periodically polls the
NodesStatusServer
. I'd also suggest we avoid any locking here. Let's treat the map as copy-on-write and create a new one on each update and set it with anatomic.StorePointer
.LatencyGetter.GetLatency
can then just load a reference to the map using anatomic.LoadPointer
and performs whatever computation is needed.It also strikes me as odd that any of this is needed. We already track latency in the RPC context (see
rpc.Context.RemoteClocks.Latency
), which is whatDistSender
uses to rank replicas by latency. Did we consider using that for this use case instead of hitting theStatusServer
?Until we make one of these changes, I suggest that we disable these latency statistics.
The text was updated successfully, but these errors were encountered: