Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: set health status to unknown when raftstore gets stuck #12411

Merged
merged 3 commits into from
Apr 29, 2022

Conversation

sticnarf
Copy link
Contributor

What is changed and how it works?

Issue Number: Close #12398

What's Changed:

The client uses health service to determine whether it should
still send requests to this TiKV or whether it should refresh
related region cache.

But if the raftstore alone becomes unavailable because of IO hang
or bugs, the health service still returns Serving status. This may
mislead the TiKV client and increase the recover time even if the
leader is already transferred to other TiKV instances.

This commit reuses the mechanism of slow score calculation
to detect whether the raftstore is normally working. So, the client
can refresh its region cache in time.

Check List

Tests

  • Integration test

Release note

Report bad health status if raftstore stops working.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 21, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • 5kbpers
  • BusyJay

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 21, 2022
@sticnarf sticnarf force-pushed the health-check-raftstore branch from daea447 to 286876b Compare April 21, 2022 12:22
@ti-chi-bot ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 21, 2022
@sticnarf
Copy link
Contributor Author

I'm not sure whether this is a good approach. If there are better ways to do this, I'm happy to change it.

@sticnarf sticnarf force-pushed the health-check-raftstore branch from 286876b to 684fec5 Compare April 22, 2022 02:16
@sticnarf sticnarf requested review from 5kbpers and BusyJay April 22, 2022 05:49
@BusyJay
Copy link
Member

BusyJay commented Apr 22, 2022

When will the slow score be recalculated? If the store stuck for 5 seconds and then recover, how can client notice the recovery in time?

@sticnarf
Copy link
Contributor Author

When will the slow score be recalculated? If the store stuck for 5 seconds and then recover, how can client notice the recovery in time?

The slow score calculation is driven by PD worker. It sends StoreMsg::LatencyInspect at inspect_interval (500ms by default), and the slow score is updated every 30 ticks. So, it means, the health status will recover in 15 seconds.

As regards client-go, if it finds a store is not healthy, it will recheck its status every 1 second.

@BusyJay
Copy link
Member

BusyJay commented Apr 22, 2022

So in this case, use health status will cause regression, right?

@sticnarf
Copy link
Contributor Author

sticnarf commented Apr 25, 2022

So in this case, use health status will cause regression, right?

Personally, I think it an acceptable regression. The update interval is 15 seconds, so the health status does not change until the raftstore keeps stuck for more than 15 seconds. And if there is such an extreme situation where it has been stuck for 15 seconds, it seems not a big problem to wait for another 15 seconds to get recovered.

@BusyJay
Copy link
Member

BusyJay commented Apr 27, 2022

How about if UNKNOWN is returned, schedule the requests at random fashion?

@sticnarf
Copy link
Contributor Author

How about if UNKNOWN is returned, schedule the requests at random fashion?

In the current client-go implementation, if the liveness of a store is not reachable, it will not filter out the store. If request forwarding is enabled, it will try forwarding. Otherwise, it only refreshes the cache related to the store.

So, if the liveness is still not reachable, but the latest region info from PD indicates the leader is on that store, the client will still send request to that store. The only affected case is when request forwarding is enabled, forwarding will continue functioning until the liveness status gets healthy.

@sticnarf
Copy link
Contributor Author

@BusyJay Do you have any other concerns about the change?

ServingStatus::Serving
};
if let Some(health_service) = &self.health_service {
health_service.set_serving_status("", health_status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about updating the health status according to the slow score or the ratio of timeout records? For instance, if the ratio of timeout records exceeds ratio_thresh, then set the status as ServiceUnknown as the same as the way we update the slow score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to be conservative. This PR only intends to solve the problem when the raftstore is totally stuck instead of just being slow.

Because being slow does not mean the leader will certainly transfer, setting the status to unknown too easily will cause unexpected request forwarding.

And if the leader transfers due to just slowness, the TiKV can still return NotLeader to the client. It does not need the involvement of the health status.

components/raftstore/src/store/worker/pd.rs Outdated Show resolved Hide resolved
@BusyJay
Copy link
Member

BusyJay commented Apr 29, 2022

, if the liveness of a store is not reachable, it will not filter out the store. If request forwarding is enabled, it will try forwarding

This seems wrong to me. The new behavior may cause one slow stores slow down other nodes as forwarding is not cost free.

@sticnarf
Copy link
Contributor Author

, if the liveness of a store is not reachable, it will not filter out the store. If request forwarding is enabled, it will try forwarding

This seems wrong to me. The new behavior may cause one slow stores slow down other nodes as forwarding is not cost free.

Now, I change the unknown status to become "strict enter, easy exit".

The status is changed to unknown only if no StoreMsg is not handled within 15 seconds (in this case, the leader should have dropped). And as long as one tick does not timeout, the status will change back to serving.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 29, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 29, 2022
@sticnarf
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

@sticnarf: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 3bd0924

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 29, 2022
@sticnarf sticnarf merged commit 2552144 into tikv:master Apr 29, 2022
sticnarf added a commit to sticnarf/tikv that referenced this pull request Apr 29, 2022
sticnarf added a commit that referenced this pull request Apr 30, 2022
@sticnarf sticnarf added needs-cherry-pick-release-5.3 Type: Need cherry pick to release-5.3 needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. labels Jun 14, 2022
@sticnarf
Copy link
Contributor Author

/run-cherry-picker

ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022
@ti-srebot
Copy link
Contributor

cherry pick to release-6.1 in PR #12816

ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022
@ti-srebot
Copy link
Contributor

cherry pick to release-5.3 in PR #12817

ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022
@ti-srebot
Copy link
Contributor

cherry pick to release-5.4 in PR #12818

sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022
ti-chi-bot added a commit that referenced this pull request Jun 14, 2022
…12817)

close #12398, ref #12411

The client uses health service to determine whether it should
still send requests to this TiKV or whether it should refresh
related region cache.

But if the raftstore alone becomes unavailable because of IO hang
or bugs, the health service still returns Serving status. This may
mislead the TiKV client and increase the recover time even if the
leader is already transferred to other TiKV instances.

This commit reuses the mechanism of slow score calculation
to detect whether the raftstore is normally working. So, the client
can refresh its region cache in time.

Signed-off-by: Yilin Chen <[email protected]>

Co-authored-by: Yilin Chen <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022
Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>
@sticnarf sticnarf removed the needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. label Jun 16, 2022
sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022
sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022
Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>
sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022
Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>
ti-chi-bot added a commit that referenced this pull request Jun 21, 2022
…12818)

close #12398, ref #12411

The client uses health service to determine whether it should
still send requests to this TiKV or whether it should refresh
related region cache.

But if the raftstore alone becomes unavailable because of IO hang
or bugs, the health service still returns Serving status. This may
mislead the TiKV client and increase the recover time even if the
leader is already transferred to other TiKV instances.

This commit reuses the mechanism of slow score calculation
to detect whether the raftstore is normally working. So, the client
can refresh its region cache in time.

Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>

Co-authored-by: Yilin Chen <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-5.3 Type: Need cherry pick to release-5.3 needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make health service aware of raftstore unavailability
5 participants