schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue #1029

windtalker · 2023-10-19T09:43:17Z

Describe

Ref pingcap/tidb#35418 for details.
The basic idea is for region that has unavailable TiFlash peers, we should reload the region so TiDB will be aware if related TiFlash is back.

This pr

add two more flag in Region to indicate whether this region has unavailable TiFlash peers and log the last load time for this region
check and schedule reload in GetTiFlashRPCContext
use async reload to reload the region

Note inorder to resolve the cycle dependency of client-go integration tests and tidb, I have to use a local tidb in this pr, will update it once TiDB code is merged.

Test

Test step

Setup a TiDB cluster with 2 TiFlash nodes, load TPCH 100 data with 2 TiFlash replicas
Stop a TiFlash node, wait for about ~20 minutes
Start a mysql client to run TPCH query 1 continuously
Start the TiFlash node

Test result

Without this pr

When TiFlash node is back around 15:22, the load keep extremely unbalanced between the 2 TiFlash nodes
With this pr

When TiFlash node is back around 16:45, the load automatically balanced at ~16:55

cfzjywxk · 2023-10-20T01:47:08Z

/cc @crazycs520

Signed-off-by: xufei <[email protected]>

crazycs520 · 2023-10-20T06:53:45Z

integration_tests/go.mod

@@ -119,3 +119,5 @@ replace (
 	github.com/go-ldap/ldap/v3 => github.com/YangKeao/ldap/v3 v3.4.5-0.20230421065457-369a3bab1117
 	github.com/tikv/client-go/v2 => ../
 )
+
+replace github.com/pingcap/tidb => github.com/windtalker/tidb v1.1.0-beta.0.20231020063218-4d1c15539f3f


Why need this?

Because there is a cycle dependency between tidb and client-go integration test. I change the interface of Cluster, intergration test will fail, I have use a local tidb to make the test pass.

Signed-off-by: xufei <[email protected]>

you06 · 2023-10-23T06:26:26Z

internal/locate/region_cache.go

+		}
+
+		if store.storeType == tikvrpc.TiFlash {
+			r.hasUnavailableTiFlashStore = true


It seems the unavailable flag is set when loading region from PD.
If all the TiFlash stores are up when loading the region, then one of them goes down, the hasUnavailableTiFlashStore will be false, and the cached region might be used continuingly and may never loading again. Then the reload will also be skipped.
Correct me if I understand something wrong.

Yes, your understanding is right. If all the TiFlash nodes are up, and there is continuous query on TiFlash, the region will not be out-dated, and if one of the TiFlash goes down, the region cache is not aware of it. But although region cache is not aware of the down node, TiDB mpp can handle this correctly because for each mpp query, TiDB will send isAlive rpc to all the candidate TiFlash nodes, and if fail to get response or the response is false, TiDB will not send task to that TiFlash node.

TiDB will not send task to that TiFlash node.

Even after the node is recovered, TiDB still do not send task to it?

Once region cache can "see" the TiFlash node, TiDB will send task to it if it is back. In the case we discussed above, region cache can always "see" the TiFlash node, so TiDB will send task to it after it is recovered.

windtalker · 2023-10-23T08:32:06Z

/run-unit-tests

cfzjywxk requested review from zyguan, you06 and ekexium October 20, 2023 01:47

cfzjywxk self-requested a review October 20, 2023 01:47

disksing approved these changes Oct 20, 2023

View reviewed changes

windtalker and others added 2 commits October 20, 2023 13:28

save work

3be445c

Signed-off-by: xufei <[email protected]>

fix format

2d20e8f

Signed-off-by: xufei <[email protected]>

windtalker force-pushed the fix_unbalance branch from fe14eb0 to 2d20e8f Compare October 20, 2023 05:29

fix ci

0b8399c

Signed-off-by: xufei <[email protected]>

crazycs520 reviewed Oct 20, 2023

View reviewed changes

windtalker added 2 commits October 20, 2023 15:22

save work

4b94d5c

Signed-off-by: xufei <[email protected]>

Merge branch 'master' into fix_unbalance

6429bef

crazycs520 approved these changes Oct 20, 2023

View reviewed changes

you06 reviewed Oct 23, 2023

View reviewed changes

Merge branch 'master' into fix_unbalance

49978de

you06 approved these changes Oct 23, 2023

View reviewed changes

disksing merged commit cad3142 into tikv:master Oct 25, 2023
9 of 10 checks passed

This was referenced Jan 16, 2024

do not update atime when region has down peers #1118

Closed

improve region reload strategy #1122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue #1029

schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue #1029

windtalker commented Oct 19, 2023 •

edited

Loading

cfzjywxk commented Oct 20, 2023

crazycs520 Oct 20, 2023

windtalker Oct 20, 2023

you06 Oct 23, 2023

windtalker Oct 23, 2023

you06 Oct 23, 2023

windtalker Oct 23, 2023

windtalker commented Oct 23, 2023

schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue #1029

schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue #1029

Conversation

windtalker commented Oct 19, 2023 • edited Loading

Describe

Test

Test step

Test result

cfzjywxk commented Oct 20, 2023

crazycs520 Oct 20, 2023

Choose a reason for hiding this comment

windtalker Oct 20, 2023

Choose a reason for hiding this comment

you06 Oct 23, 2023

Choose a reason for hiding this comment

windtalker Oct 23, 2023

Choose a reason for hiding this comment

you06 Oct 23, 2023

Choose a reason for hiding this comment

windtalker Oct 23, 2023

Choose a reason for hiding this comment

windtalker commented Oct 23, 2023

windtalker commented Oct 19, 2023 •

edited

Loading