-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedule async reload for region that has unavailable tiflash peers to avoid load un-balance issue #1029
Conversation
/cc @crazycs520 |
Signed-off-by: xufei <[email protected]>
Signed-off-by: xufei <[email protected]>
fe14eb0
to
2d20e8f
Compare
Signed-off-by: xufei <[email protected]>
@@ -119,3 +119,5 @@ replace ( | |||
github.com/go-ldap/ldap/v3 => github.com/YangKeao/ldap/v3 v3.4.5-0.20230421065457-369a3bab1117 | |||
github.com/tikv/client-go/v2 => ../ | |||
) | |||
|
|||
replace github.com/pingcap/tidb => github.com/windtalker/tidb v1.1.0-beta.0.20231020063218-4d1c15539f3f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is a cycle dependency between tidb and client-go integration test. I change the interface of Cluster
, intergration test will fail, I have use a local tidb to make the test pass.
Signed-off-by: xufei <[email protected]>
} | ||
|
||
if store.storeType == tikvrpc.TiFlash { | ||
r.hasUnavailableTiFlashStore = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the unavailable flag is set when loading region from PD.
If all the TiFlash stores are up when loading the region, then one of them goes down, the hasUnavailableTiFlashStore
will be false, and the cached region might be used continuingly and may never loading again. Then the reload will also be skipped.
Correct me if I understand something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your understanding is right. If all the TiFlash nodes are up, and there is continuous query on TiFlash, the region will not be out-dated, and if one of the TiFlash goes down, the region cache is not aware of it. But although region cache is not aware of the down node, TiDB mpp can handle this correctly because for each mpp query, TiDB will send isAlive
rpc to all the candidate TiFlash nodes, and if fail to get response or the response is false
, TiDB will not send task to that TiFlash node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TiDB will not send task to that TiFlash node.
Even after the node is recovered, TiDB still do not send task to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once region cache can "see" the TiFlash node, TiDB will send task to it if it is back. In the case we discussed above, region cache can always "see" the TiFlash node, so TiDB will send task to it after it is recovered.
/run-unit-tests |
Describe
Ref pingcap/tidb#35418 for details.
The basic idea is for region that has unavailable TiFlash peers, we should reload the region so TiDB will be aware if related TiFlash is back.
This pr
Region
to indicate whether this region has unavailable TiFlash peers and log the last load time for this regionGetTiFlashRPCContext
Note inorder to resolve the cycle dependency of client-go integration tests and tidb, I have to use a local tidb in this pr, will update it once TiDB code is merged.
Test
Test step
Test result
Without this pr
When TiFlash node is back around 15:22, the load keep extremely unbalanced between the 2 TiFlash nodes
With this pr
When TiFlash node is back around 16:45, the load automatically balanced at ~16:55