*: set health status to unknown when raftstore gets stuck #12411

sticnarf · 2022-04-21T12:22:18Z

What is changed and how it works?

Issue Number: Close #12398

What's Changed:

The client uses health service to determine whether it should
still send requests to this TiKV or whether it should refresh
related region cache.

But if the raftstore alone becomes unavailable because of IO hang
or bugs, the health service still returns Serving status. This may
mislead the TiKV client and increase the recover time even if the
leader is already transferred to other TiKV instances.

This commit reuses the mechanism of slow score calculation
to detect whether the raftstore is normally working. So, the client
can refresh its region cache in time.

Check List

Tests

Integration test

Release note

Report bad health status if raftstore stops working.

ti-chi-bot · 2022-04-21T12:22:19Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

5kbpers
BusyJay

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

sticnarf · 2022-04-21T12:23:46Z

I'm not sure whether this is a good approach. If there are better ways to do this, I'm happy to change it.

Signed-off-by: Yilin Chen <[email protected]>

BusyJay · 2022-04-22T09:23:25Z

When will the slow score be recalculated? If the store stuck for 5 seconds and then recover, how can client notice the recovery in time?

sticnarf · 2022-04-22T09:49:19Z

When will the slow score be recalculated? If the store stuck for 5 seconds and then recover, how can client notice the recovery in time?

The slow score calculation is driven by PD worker. It sends StoreMsg::LatencyInspect at inspect_interval (500ms by default), and the slow score is updated every 30 ticks. So, it means, the health status will recover in 15 seconds.

As regards client-go, if it finds a store is not healthy, it will recheck its status every 1 second.

BusyJay · 2022-04-22T11:02:55Z

So in this case, use health status will cause regression, right?

sticnarf · 2022-04-25T05:38:42Z

So in this case, use health status will cause regression, right?

Personally, I think it an acceptable regression. The update interval is 15 seconds, so the health status does not change until the raftstore keeps stuck for more than 15 seconds. And if there is such an extreme situation where it has been stuck for 15 seconds, it seems not a big problem to wait for another 15 seconds to get recovered.

BusyJay · 2022-04-27T07:56:16Z

How about if UNKNOWN is returned, schedule the requests at random fashion?

sticnarf · 2022-04-27T09:32:06Z

How about if UNKNOWN is returned, schedule the requests at random fashion?

In the current client-go implementation, if the liveness of a store is not reachable, it will not filter out the store. If request forwarding is enabled, it will try forwarding. Otherwise, it only refreshes the cache related to the store.

So, if the liveness is still not reachable, but the latest region info from PD indicates the leader is on that store, the client will still send request to that store. The only affected case is when request forwarding is enabled, forwarding will continue functioning until the liveness status gets healthy.

sticnarf · 2022-04-28T09:45:19Z

@BusyJay Do you have any other concerns about the change?

5kbpers · 2022-04-29T06:41:35Z

components/raftstore/src/store/worker/pd.rs

+                    ServingStatus::Serving
+                };
+            if let Some(health_service) = &self.health_service {
+                health_service.set_serving_status("", health_status);


How about updating the health status according to the slow score or the ratio of timeout records? For instance, if the ratio of timeout records exceeds ratio_thresh, then set the status as ServiceUnknown as the same as the way we update the slow score.

I'm inclined to be conservative. This PR only intends to solve the problem when the raftstore is totally stuck instead of just being slow.

Because being slow does not mean the leader will certainly transfer, setting the status to unknown too easily will cause unexpected request forwarding.

And if the leader transfers due to just slowness, the TiKV can still return NotLeader to the client. It does not need the involvement of the health status.

components/raftstore/src/store/worker/pd.rs

BusyJay · 2022-04-29T08:02:25Z

, if the liveness of a store is not reachable, it will not filter out the store. If request forwarding is enabled, it will try forwarding

This seems wrong to me. The new behavior may cause one slow stores slow down other nodes as forwarding is not cost free.

Signed-off-by: Yilin Chen <[email protected]>

…store

sticnarf · 2022-04-29T09:03:15Z

, if the liveness of a store is not reachable, it will not filter out the store. If request forwarding is enabled, it will try forwarding

This seems wrong to me. The new behavior may cause one slow stores slow down other nodes as forwarding is not cost free.

Now, I change the unknown status to become "strict enter, easy exit".

The status is changed to unknown only if no StoreMsg is not handled within 15 seconds (in this case, the leader should have dropped). And as long as one tick does not timeout, the status will change back to serving.

sticnarf · 2022-04-29T12:25:02Z

/merge

ti-chi-bot · 2022-04-29T12:25:03Z

@sticnarf: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2022-04-29T12:25:05Z

This pull request has been accepted and is ready to merge.

Commit hash: 3bd0924

Signed-off-by: Yilin Chen <[email protected]>

…12447) Signed-off-by: Yilin Chen <[email protected]>

sticnarf · 2022-06-14T05:31:59Z

/run-cherry-picker

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2022-06-14T05:32:23Z

cherry pick to release-6.1 in PR #12816

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2022-06-14T05:32:59Z

cherry pick to release-5.3 in PR #12817

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2022-06-14T05:33:35Z

cherry pick to release-5.4 in PR #12818

tikv#12447) Signed-off-by: Yilin Chen <[email protected]>

…12817) close #12398, ref #12411 The client uses health service to determine whether it should still send requests to this TiKV or whether it should refresh related region cache. But if the raftstore alone becomes unavailable because of IO hang or bugs, the health service still returns Serving status. This may mislead the TiKV client and increase the recover time even if the leader is already transferred to other TiKV instances. This commit reuses the mechanism of slow score calculation to detect whether the raftstore is normally working. So, the client can refresh its region cache in time. Signed-off-by: Yilin Chen <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>

Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>

Signed-off-by: ti-srebot <[email protected]>

Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>

…12818) close #12398, ref #12411 The client uses health service to determine whether it should still send requests to this TiKV or whether it should refresh related region cache. But if the raftstore alone becomes unavailable because of IO hang or bugs, the health service still returns Serving status. This may mislead the TiKV client and increase the recover time even if the leader is already transferred to other TiKV instances. This commit reuses the mechanism of slow score calculation to detect whether the raftstore is normally working. So, the client can refresh its region cache in time. Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>

ti-chi-bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 21, 2022

sticnarf force-pushed the health-check-raftstore branch from daea447 to 286876b Compare April 21, 2022 12:22

ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 21, 2022

*: set health status to unknown when raftstore gets stuck

684fec5

Signed-off-by: Yilin Chen <[email protected]>

sticnarf force-pushed the health-check-raftstore branch from 286876b to 684fec5 Compare April 22, 2022 02:16

sticnarf requested review from 5kbpers and BusyJay April 22, 2022 05:49

5kbpers reviewed Apr 29, 2022

View reviewed changes

sticnarf added 2 commits April 29, 2022 16:58

speed up health status recovering

6d6cfbb

Signed-off-by: Yilin Chen <[email protected]>

Merge remote-tracking branch 'upstream/master' into health-check-raft…

3bd0924

…store

BusyJay approved these changes Apr 29, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 29, 2022

5kbpers approved these changes Apr 29, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 29, 2022

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 29, 2022

sticnarf merged commit 2552144 into tikv:master Apr 29, 2022

sticnarf added a commit to sticnarf/tikv that referenced this pull request Apr 29, 2022

*: set health status to unknown when raftstore gets stuck (tikv#12411)

75e83ca

Signed-off-by: Yilin Chen <[email protected]>

sticnarf added a commit that referenced this pull request Apr 30, 2022

*: set health status to unknown when raftstore gets stuck (#12411) (#…

90883bb

…12447) Signed-off-by: Yilin Chen <[email protected]>

sticnarf added needs-cherry-pick-release-5.3 Type: Need cherry pick to release-5.3 needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. labels Jun 14, 2022

ti-srebot mentioned this pull request Jun 14, 2022

*: set health status to unknown when raftstore gets stuck (#12411) #12816

Closed

ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022

cherry pick tikv#12411 to release-6.1

20e74e5

Signed-off-by: ti-srebot <[email protected]>

ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022

cherry pick tikv#12411 to release-5.3

99125bf

Signed-off-by: ti-srebot <[email protected]>

ti-srebot mentioned this pull request Jun 14, 2022

*: set health status to unknown when raftstore gets stuck (#12411) #12817

Merged

ti-srebot pushed a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022

cherry pick tikv#12411 to release-5.4

5996f57

Signed-off-by: ti-srebot <[email protected]>

ti-srebot mentioned this pull request Jun 14, 2022

*: set health status to unknown when raftstore gets stuck (#12411) #12818

Merged

sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 14, 2022

*: set health status to unknown when raftstore gets stuck (tikv#12411) (

599cf83

tikv#12447) Signed-off-by: Yilin Chen <[email protected]>

sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022

cherry pick tikv#12411 to release-6.1

8fc5783

Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>

sticnarf removed the needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. label Jun 16, 2022

sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022

cherry pick tikv#12411 to release-5.4

2e729e2

Signed-off-by: ti-srebot <[email protected]>

sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022

cherry pick tikv#12411 to release-5.4

9c3b9a3

Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>

sticnarf added a commit to ti-srebot/tikv that referenced this pull request Jun 16, 2022

cherry pick tikv#12411 to release-5.4

b9bfbf0

Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: set health status to unknown when raftstore gets stuck #12411

*: set health status to unknown when raftstore gets stuck #12411

sticnarf commented Apr 21, 2022

ti-chi-bot commented Apr 21, 2022 •

edited

Loading

sticnarf commented Apr 21, 2022

BusyJay commented Apr 22, 2022

sticnarf commented Apr 22, 2022

BusyJay commented Apr 22, 2022

sticnarf commented Apr 25, 2022 •

edited

Loading

BusyJay commented Apr 27, 2022

sticnarf commented Apr 27, 2022

sticnarf commented Apr 28, 2022

5kbpers Apr 29, 2022

sticnarf Apr 29, 2022

BusyJay commented Apr 29, 2022

sticnarf commented Apr 29, 2022

sticnarf commented Apr 29, 2022

ti-chi-bot commented Apr 29, 2022

ti-chi-bot commented Apr 29, 2022

sticnarf commented Jun 14, 2022

ti-srebot commented Jun 14, 2022

ti-srebot commented Jun 14, 2022

ti-srebot commented Jun 14, 2022

*: set health status to unknown when raftstore gets stuck #12411

*: set health status to unknown when raftstore gets stuck #12411

Conversation

sticnarf commented Apr 21, 2022

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Apr 21, 2022 • edited Loading

sticnarf commented Apr 21, 2022

BusyJay commented Apr 22, 2022

sticnarf commented Apr 22, 2022

BusyJay commented Apr 22, 2022

sticnarf commented Apr 25, 2022 • edited Loading

BusyJay commented Apr 27, 2022

sticnarf commented Apr 27, 2022

sticnarf commented Apr 28, 2022

5kbpers Apr 29, 2022

Choose a reason for hiding this comment

sticnarf Apr 29, 2022

Choose a reason for hiding this comment

BusyJay commented Apr 29, 2022

sticnarf commented Apr 29, 2022

sticnarf commented Apr 29, 2022

ti-chi-bot commented Apr 29, 2022

ti-chi-bot commented Apr 29, 2022

sticnarf commented Jun 14, 2022

ti-srebot commented Jun 14, 2022

ti-srebot commented Jun 14, 2022

ti-srebot commented Jun 14, 2022

ti-chi-bot commented Apr 21, 2022 •

edited

Loading

sticnarf commented Apr 25, 2022 •

edited

Loading