kvclient: follower reads can be sent to slow node resulting in high latency #120519

andrewbaptist · 2024-03-14T20:52:25Z

Describe the problem

There are some mitigations in place to prevent follower reads from being sent to decommissioned or draining nodes as part of #112351, but this is insufficient. Specifically there are two additional scenarios in which we should prevent sending follower reads.

A node that has recently restarted. We treat a node as suspect in the allocator as of kvserver: Always treat restarted nodes as suspect #97263 when it has restarted these checks don't apply to follower reads. We should prevent sending follower reads to recently restarted nodes until they have had a chance to fully recover after being offline.
A node that is overloaded. We prevent overloaded nodes from receiving leases as part of kvserver: consider io overload for lease transfers #96508, but again this doesn't apply to follower reads. If we stop follower reads to overloaded nodes, we allow the node to recover faster and additionally remove the latency impact of a read on an io overloaded node.

An complementary solution would be to implement #109320 which will allow mitigate some of the impact, but both these features will work better together. There is also some handling with regards to how we sort replicas to reduce sending requests to nodes with high RTT, however this only handles very extreme problems and doesn't handle the typical issues we see.

To reproduce

Start a large cluster with both a heavy write load and a significant number of follower reads.
Introduce a fault to make one of the nodes slower
Notice that the node is still receiving as many follower reads as other nodes in the system resulting in high P99 latency for these follower reads.

There are various faults that can be induced in step 2 above. Some of them are stopping a node for an extended outage, slowing disk IO throughput, creating an index which creates uneven load on the system or general network flakiness to a node.

Jira issue: CRDB-36726

andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvclient: follower reads can be sent to slow node resulting in high latency #120519

kvclient: follower reads can be sent to slow node resulting in high latency #120519

andrewbaptist commented Mar 14, 2024 •

edited by cockroach-jira-scripts

Loading

kvclient: follower reads can be sent to slow node resulting in high latency #120519

kvclient: follower reads can be sent to slow node resulting in high latency #120519

Comments

andrewbaptist commented Mar 14, 2024 • edited by cockroach-jira-scripts Loading

andrewbaptist commented Mar 14, 2024 •

edited by cockroach-jira-scripts

Loading