-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: set health status to unknown when raftstore gets stuck #12411
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
daea447
to
286876b
Compare
I'm not sure whether this is a good approach. If there are better ways to do this, I'm happy to change it. |
Signed-off-by: Yilin Chen <[email protected]>
286876b
to
684fec5
Compare
When will the slow score be recalculated? If the store stuck for 5 seconds and then recover, how can client notice the recovery in time? |
The slow score calculation is driven by PD worker. It sends As regards client-go, if it finds a store is not healthy, it will recheck its status every 1 second. |
So in this case, use health status will cause regression, right? |
Personally, I think it an acceptable regression. The update interval is 15 seconds, so the health status does not change until the raftstore keeps stuck for more than 15 seconds. And if there is such an extreme situation where it has been stuck for 15 seconds, it seems not a big problem to wait for another 15 seconds to get recovered. |
How about if UNKNOWN is returned, schedule the requests at random fashion? |
In the current client-go implementation, if the liveness of a store is not So, if the liveness is still not |
@BusyJay Do you have any other concerns about the change? |
ServingStatus::Serving | ||
}; | ||
if let Some(health_service) = &self.health_service { | ||
health_service.set_serving_status("", health_status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about updating the health status according to the slow score or the ratio of timeout records? For instance, if the ratio of timeout records exceeds ratio_thresh
, then set the status as ServiceUnknown
as the same as the way we update the slow score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to be conservative. This PR only intends to solve the problem when the raftstore is totally stuck instead of just being slow.
Because being slow does not mean the leader will certainly transfer, setting the status to unknown too easily will cause unexpected request forwarding.
And if the leader transfers due to just slowness, the TiKV can still return NotLeader
to the client. It does not need the involvement of the health status.
This seems wrong to me. The new behavior may cause one slow stores slow down other nodes as forwarding is not cost free. |
Signed-off-by: Yilin Chen <[email protected]>
Now, I change the unknown status to become "strict enter, easy exit". The status is changed to unknown only if no |
/merge |
@sticnarf: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: 3bd0924
|
Signed-off-by: Yilin Chen <[email protected]>
…12447) Signed-off-by: Yilin Chen <[email protected]>
/run-cherry-picker |
Signed-off-by: ti-srebot <[email protected]>
cherry pick to release-6.1 in PR #12816 |
Signed-off-by: ti-srebot <[email protected]>
cherry pick to release-5.3 in PR #12817 |
Signed-off-by: ti-srebot <[email protected]>
cherry pick to release-5.4 in PR #12818 |
tikv#12447) Signed-off-by: Yilin Chen <[email protected]>
…12817) close #12398, ref #12411 The client uses health service to determine whether it should still send requests to this TiKV or whether it should refresh related region cache. But if the raftstore alone becomes unavailable because of IO hang or bugs, the health service still returns Serving status. This may mislead the TiKV client and increase the recover time even if the leader is already transferred to other TiKV instances. This commit reuses the mechanism of slow score calculation to detect whether the raftstore is normally working. So, the client can refresh its region cache in time. Signed-off-by: Yilin Chen <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>
Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>
Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>
…12818) close #12398, ref #12411 The client uses health service to determine whether it should still send requests to this TiKV or whether it should refresh related region cache. But if the raftstore alone becomes unavailable because of IO hang or bugs, the health service still returns Serving status. This may mislead the TiKV client and increase the recover time even if the leader is already transferred to other TiKV instances. This commit reuses the mechanism of slow score calculation to detect whether the raftstore is normally working. So, the client can refresh its region cache in time. Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
What is changed and how it works?
Issue Number: Close #12398
What's Changed:
Check List
Tests
Release note