Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: latency stats cause large jumps in tail latency in multi-region clusters #59093

Closed
nvanbenschoten opened this issue Jan 17, 2021 · 1 comment · Fixed by #59131
Closed
Assignees
Labels
A-multiregion Related to multi-region A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@nvanbenschoten
Copy link
Member

In #55705, we added network latency stats to EXPLAIN ANALYZE diagrams, which was a very cool observability improvement! With that change, users of EXPLAIN ANALYZE can now better understand the latency between the nodes in their dataflow graph. This is a win for single-region clusters and an even larger win for multi-region clusters.

Unfortunately, the change also seems to be causing some issues. In recent multi-region demos, we've seen that follower reads periodically jump from 1-2ms up to 80-160ms. We tracked this down to the call that the newly introduced LatencyGetter.GetLatency performs through the NodesStatusServer.Nodes every 5 seconds. This operation performs a synchronous, consistent read of the /System/StatusNode table to update cached latency figures.

This will clearly not work well in a multi-region cluster, where this /System/StatusNode table may be stored in a different region. In fact, it's a flawed approach even for single-region clusters. We should not be performing any expensive work synchronously during the operation of a query that isn't strictly necessary to service the query. Additionally, we certainly shouldn't be doing so while holding an exclusive lock that will block every other statement on the expensive work. Doing so will cause a spike in latency for all queries every 5 seconds, driving up tail latencies.

If we want to keep this latency map relatively up to date, I'd suggest we introduce some background process that periodically polls the NodesStatusServer. I'd also suggest we avoid any locking here. Let's treat the map as copy-on-write and create a new one on each update and set it with an atomic.StorePointer. LatencyGetter.GetLatency can then just load a reference to the map using an atomic.LoadPointer and performs whatever computation is needed.

It also strikes me as odd that any of this is needed. We already track latency in the RPC context (see rpc.Context.RemoteClocks.Latency), which is what DistSender uses to rank replicas by latency. Did we consider using that for this use case instead of hitting the StatusServer?

Until we make one of these changes, I suggest that we disable these latency statistics.

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-execution Relating to SQL execution. A-multiregion Related to multi-region labels Jan 17, 2021
@asubiotto
Copy link
Contributor

Thanks for pointing this out! PR incoming to use the RPCContext since it's the cleanest solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants