sql: latency stats cause large jumps in tail latency in multi-region clusters #59093

nvanbenschoten · 2021-01-17T07:04:16Z

In #55705, we added network latency stats to EXPLAIN ANALYZE diagrams, which was a very cool observability improvement! With that change, users of EXPLAIN ANALYZE can now better understand the latency between the nodes in their dataflow graph. This is a win for single-region clusters and an even larger win for multi-region clusters.

Unfortunately, the change also seems to be causing some issues. In recent multi-region demos, we've seen that follower reads periodically jump from 1-2ms up to 80-160ms. We tracked this down to the call that the newly introduced LatencyGetter.GetLatency performs through the NodesStatusServer.Nodes every 5 seconds. This operation performs a synchronous, consistent read of the /System/StatusNode table to update cached latency figures.

This will clearly not work well in a multi-region cluster, where this /System/StatusNode table may be stored in a different region. In fact, it's a flawed approach even for single-region clusters. We should not be performing any expensive work synchronously during the operation of a query that isn't strictly necessary to service the query. Additionally, we certainly shouldn't be doing so while holding an exclusive lock that will block every other statement on the expensive work. Doing so will cause a spike in latency for all queries every 5 seconds, driving up tail latencies.

If we want to keep this latency map relatively up to date, I'd suggest we introduce some background process that periodically polls the NodesStatusServer. I'd also suggest we avoid any locking here. Let's treat the map as copy-on-write and create a new one on each update and set it with an atomic.StorePointer. LatencyGetter.GetLatency can then just load a reference to the map using an atomic.LoadPointer and performs whatever computation is needed.

It also strikes me as odd that any of this is needed. We already track latency in the RPC context (see rpc.Context.RemoteClocks.Latency), which is what DistSender uses to rank replicas by latency. Did we consider using that for this use case instead of hitting the StatusServer?

Until we make one of these changes, I suggest that we disable these latency statistics.

The text was updated successfully, but these errors were encountered:

asubiotto · 2021-01-19T10:46:46Z

Thanks for pointing this out! PR incoming to use the RPCContext since it's the cleanest solution.

nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. A-multiregion Related to multi-region labels Jan 17, 2021

nvanbenschoten assigned yuzefovich and asubiotto Jan 17, 2021

asubiotto mentioned this issue Jan 19, 2021

server: replace LatencyGetter with nodedialer.Dialer.Latency #59131

Merged

asubiotto unassigned yuzefovich Jan 19, 2021

craig bot closed this as completed in 7df5918 Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: latency stats cause large jumps in tail latency in multi-region clusters #59093

sql: latency stats cause large jumps in tail latency in multi-region clusters #59093

nvanbenschoten commented Jan 17, 2021

asubiotto commented Jan 19, 2021

sql: latency stats cause large jumps in tail latency in multi-region clusters #59093

sql: latency stats cause large jumps in tail latency in multi-region clusters #59093

Comments

nvanbenschoten commented Jan 17, 2021

asubiotto commented Jan 19, 2021