Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvclient: follower reads can be sent to slow node resulting in high latency #120519

Open
andrewbaptist opened this issue Mar 14, 2024 · 0 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Mar 14, 2024

Describe the problem

There are some mitigations in place to prevent follower reads from being sent to decommissioned or draining nodes as part of #112351, but this is insufficient. Specifically there are two additional scenarios in which we should prevent sending follower reads.

  1. A node that has recently restarted. We treat a node as suspect in the allocator as of kvserver: Always treat restarted nodes as suspect #97263 when it has restarted these checks don't apply to follower reads. We should prevent sending follower reads to recently restarted nodes until they have had a chance to fully recover after being offline.
  2. A node that is overloaded. We prevent overloaded nodes from receiving leases as part of kvserver: consider io overload for lease transfers #96508, but again this doesn't apply to follower reads. If we stop follower reads to overloaded nodes, we allow the node to recover faster and additionally remove the latency impact of a read on an io overloaded node.

An complementary solution would be to implement #109320 which will allow mitigate some of the impact, but both these features will work better together. There is also some handling with regards to how we sort replicas to reduce sending requests to nodes with high RTT, however this only handles very extreme problems and doesn't handle the typical issues we see.

To reproduce

  1. Start a large cluster with both a heavy write load and a significant number of follower reads.
  2. Introduce a fault to make one of the nodes slower
  3. Notice that the node is still receiving as many follower reads as other nodes in the system resulting in high P99 latency for these follower reads.

There are various faults that can be induced in step 2 above. Some of them are stopping a node for an extended outage, slowing disk IO throughput, creating an index which creates uneven load on the system or general network flakiness to a node.

Jira issue: CRDB-36726

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

1 participant