-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvclient: don't send follower reads to unhealthy nodes #112351
Comments
Excluding the draining node might be easier approach compare to tweak with replica sorting. If we are adding to node health or io overload to the sorting equation. Should we make modifications to This is an interesting issue - I'm excited to read over the future pr that fixes this issue. |
Stop follower reads on draining, decommissioning or unhealthy nodes. Epic: none Fixes: cockroachdb#112351 Release note (performance improvement): This change prevents failed requests from being issued on followers that are draining, decommissioning or unhealthy which prevents latency spikes if those nodes later go offline.
Stop follower reads on draining, decommissioning or unhealthy nodes. Epic: none Fixes: cockroachdb#112351 Release note (performance improvement): This change prevents failed requests from being issued on followers that are draining, decommissioning or unhealthy which prevents latency spikes if those nodes later go offline.
Stop follower reads on draining, decommissioning or unhealthy nodes. Epic: none Fixes: cockroachdb#112351 Release note (performance improvement): This change prevents failed requests from being issued on followers that are draining, decommissioning or unhealthy which prevents latency spikes if those nodes later go offline.
This issue was discussed on Roblox monthly call and has been flagged as a blocker for which we need to provide an ETA and/or target release in 23.x. |
Stop follower reads on draining, decommissioning or unhealthy nodes. Epic: none Fixes: cockroachdb#112351 Release note (performance improvement): This change prevents failed requests from being issued on followers that are draining, decommissioning or unhealthy which prevents latency spikes if those nodes later go offline.
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
113942: kvclient: optimize and clean up sorting computation r=nvanbenschoten a=andrewbaptist Previously the locality distance and the latency function were computed multiple times for each node in the sort.Slice method. This change computes the values once when the ReplicaSlice is created and uses simple comparisions within the sorting loop. Epic: none Informs: #112351 Release note: None 114240: rangefeed: fix scheduler catchup iterator race r=erikgrinaker a=erikgrinaker It was possible for the scheduled processor to hand ownership of the catchup iterator over to the registration, but claim that it didn't by returning `false` from `Register()`. This can happen if the registration request is queued concurrently with a processor shutdown, where the registration will execute the catchup scan and close the iterator, but the caller will think it wasn't registered and double-close the iterator. This patch fixes the race, and also documents the necessary invariant along with a runtime assertion. Resolves #114192. Epic: none Release note: None 114309: logictest: add test for mixed-version configs r=RaduBerinde a=RaduBerinde This commit adds a test that verifies that for each supported previous release we have a logictest config that bootstraps the cluster at that version. Informs: #112629 Release note: None Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
Stop follower reads on draining, decommissioning or unhealthy nodes. Epic: none Fixes: cockroachdb#112351 Release note (performance improvement): This change prevents failed requests from being issued on followers that are draining, decommissioning or unhealthy which prevents latency spikes if those nodes later go offline.
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
This PR replaces ReplicaSlice with ReplicaSet outside of the DistSender. ReplicaSlice is an internal implementation which contains information only required for sorting the replicas. Outside of DistSender the additional sorting information is unnecessary and unused. Epic: none Informs: cockroachdb#112351 Release note: None
Describe the problem
A draining node is likely to shut down soon. By moving the leaseholders off we stop all non-follower reads to that node, but we don't prevent follower reads from using it. SInce we already know that it is draining, we should either exclude or prioritize last the draining or suspect nodes. This could be enhanced to also exclude nodes with IO overload or generally take the health of the node into account when deciding which node to send a follower read to.
To Reproduce
Expected behavior
From a client perspective, draining a node should allow clean shutdown without a performance impact. Follower reads break this model because while all the leases are moved, so normal reads and writes do not require this node, follower reads don't take this into account.
Additional context
This was noticed in a customer investigation where they had a 500ms client timeout on read requests and saw an elevated number of failures when they were cycling nodes.
Jira issue: CRDB-32365
The text was updated successfully, but these errors were encountered: