Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix unexpected slow query during GC running after stop 1 tikv-server #899

Merged
merged 5 commits into from
Jul 24, 2023

Conversation

crazycs520
Copy link
Contributor

@crazycs520 crazycs520 commented Jul 20, 2023

close #898

Why issue #898 happen?

After stop 1 tikv-server, some region replicas are marked by replica.isEpochStale is true. Then accessFollower won't choose the replica anymore.

But when TiDB GC leader start to running GC, it will reload all region, then all region replicas epoch will be update, which means all region replica's isEpochStale will change to false. Then accessFollower may choose the replica which in down tikv-server. Then TiDB may send kv request to down tikv-server will receive context deadline exceeded error and re-send kv request to the region leader. This is what causes slow queries.

How this fix work?

In short, accessFollower need to check the replica's store LivenessState when chose target replica.

Before This PR:

image

This PR:

image

@crazycs520 crazycs520 marked this pull request as ready for review July 20, 2023 11:13
Signed-off-by: crazycs520 <[email protected]>
@you06
Copy link
Contributor

you06 commented Jul 21, 2023

This problem reminds me of the issue from the TiDB forum: https://ask.pingcap.com/t/every-10-minutes-my-in-flight-stale-reads-fail/518

Signed-off-by: crazycs520 <[email protected]>
@cfzjywxk cfzjywxk requested review from zyguan and you06 July 24, 2023 03:22
@cfzjywxk
Copy link
Contributor

@you06 @zyguan
PTAL

Signed-off-by: crazycs520 <[email protected]>
@crazycs520
Copy link
Contributor Author

/hold since the test failed.

Signed-off-by: crazycs520 <[email protected]>
@MyonKeminta MyonKeminta merged commit 59adec2 into tikv:tidb-6.5 Jul 24, 2023
cfzjywxk pushed a commit that referenced this pull request Jul 26, 2023
…899) (#909)

* fix unexpected slow query during GC running after stop 1 tikv-server

Signed-off-by: crazycs520 <[email protected]>

* fix test

Signed-off-by: crazycs520 <[email protected]>

---------

Signed-off-by: crazycs520 <[email protected]>
crazycs520 added a commit to crazycs520/client-go that referenced this pull request Aug 7, 2023
iosmanthus added a commit that referenced this pull request Aug 11, 2023
* client-go: add some key range info to error when PD returned no region (#862)

Signed-off-by: Chao Wang <[email protected]>

* *: refine non-global stale-read request retry logic (#863)

Signed-off-by: crazycs520 <[email protected]>

* Fix the issue that primary pessimistic lock may be left not cleared after GC (#866)

* Fix the issue that primary pessimistic lock may be left not cleared after GC

Signed-off-by: MyonKeminta <[email protected]>

* Fix mysteriously shown up thing that makes compilation failed

Signed-off-by: MyonKeminta <[email protected]>

* Fix test effectiveness (forgot to set txn2 to pessimistic txn); add more strict checks

Signed-off-by: MyonKeminta <[email protected]>

* Address comments

Signed-off-by: MyonKeminta <[email protected]>

---------

Signed-off-by: MyonKeminta <[email protected]>
Co-authored-by: MyonKeminta <[email protected]>

* add explicit request source type to label the external request like lightning/br (#868)

Signed-off-by: nolouch <[email protected]>

* use '%d' instead of '%q' for some int values in error message (#875)

Signed-off-by: Chao Wang <[email protected]>

* format key in error message in method `scanRegions` (#876)

Signed-off-by: Chao Wang <[email protected]>

* make cop request timeout a config paramter (#865)

* update

Signed-off-by: Spade A <[email protected]>

* update

Signed-off-by: Spade A <[email protected]>

* update

Signed-off-by: Spade A <[email protected]>

* update

Signed-off-by: Spade A <[email protected]>

---------

Signed-off-by: Spade A <[email protected]>

* region_cache: support check pending tiflash peer (#821)

Signed-off-by: guo-shaoge <[email protected]>
Co-authored-by: disksing <[email protected]>

* *: add `SnapshotIterReverse` and make `iterReverse` supports `lowerBound` (#883)

Signed-off-by: Jason Mo <[email protected]>

* *: fix stale read ops metric (#878) (#889)

Signed-off-by: crazycs520 <[email protected]>
Co-authored-by: disksing <[email protected]>

* add gc options (#828)

Signed-off-by: weedge <[email protected]>
Co-authored-by: disksing <[email protected]>

* reload region cache when store is resolved from invalid status (#843)

Signed-off-by: you06 <[email protected]>
Co-authored-by: disksing <[email protected]>

* ci: update setup-go action (#904)

Signed-off-by: disksing <[email protected]>

* fix unexpected slow query during GC running after stop 1 tikv-server (#899) (#909)

* fix unexpected slow query during GC running after stop 1 tikv-server

Signed-off-by: crazycs520 <[email protected]>

* fix test

Signed-off-by: crazycs520 <[email protected]>

---------

Signed-off-by: crazycs520 <[email protected]>

* resource_manager: ignore ru metrics for background request (#872)

Signed-off-by: husharp <[email protected]>
Co-authored-by: disksing <[email protected]>

* add more log for diagnose (#915)

* add more log for diagnose

Signed-off-by: crazycs520 <[email protected]>

* fix

Signed-off-by: crazycs520 <[email protected]>

* add more log for diagnose

Signed-off-by: crazycs520 <[email protected]>

* add more log

Signed-off-by: crazycs520 <[email protected]>

* address comment

Signed-off-by: crazycs520 <[email protected]>

---------

Signed-off-by: crazycs520 <[email protected]>

* use context logger as much as possible (#908)

* use context logger as much as possible

Signed-off-by: crazycs520 <[email protected]>

* refine

Signed-off-by: crazycs520 <[email protected]>

---------

Signed-off-by: crazycs520 <[email protected]>

* Resume max retry time check for stale read retry with leader option(#903) (#911)

* Resume max retry time check for stale read retry with leader option

Signed-off-by: cfzjywxk <[email protected]>

* add cancel

Signed-off-by: cfzjywxk <[email protected]>

---------

Signed-off-by: cfzjywxk <[email protected]>

* request_source: remove default label (#890)

* request_source: remove default label

Signed-off-by: nolouch <[email protected]>

* add a function to set request source task type (#925)

* add a function to set request source task type

Signed-off-by: glorv <[email protected]>

* ci: update go version (#936)

* ci: update go version

Signed-off-by: crazycs520 <[email protected]>

* fix test

Signed-off-by: crazycs520 <[email protected]>

---------

Signed-off-by: crazycs520 <[email protected]>

* use tidb_kv_read_timeout as first kv request timeout (#919)

* support tidb_kv_read_timeout as first round kv request timeout

Signed-off-by: crazycs520 <[email protected]>

* fix ci

Signed-off-by: crazycs520 <[email protected]>

* fix ci

Signed-off-by: crazycs520 <[email protected]>

* fix ci

Signed-off-by: crazycs520 <[email protected]>

* fix ci

Signed-off-by: crazycs520 <[email protected]>

* fix ci

Signed-off-by: crazycs520 <[email protected]>

* update comment

Signed-off-by: crazycs520 <[email protected]>

* refine test

Signed-off-by: crazycs520 <[email protected]>

---------

Signed-off-by: crazycs520 <[email protected]>

* [pick] resource_control: bypass some internal urgent request (#938)

* resource_control: bypass some internal urgent request (#884)

Signed-off-by: nolouch <[email protected]>

* resourcecontrol: fix nil pointer (#900)

Signed-off-by: nolouch <[email protected]>

---------

Signed-off-by: nolouch <[email protected]>

---------

Signed-off-by: Chao Wang <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: MyonKeminta <[email protected]>
Signed-off-by: nolouch <[email protected]>
Signed-off-by: Spade A <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: Jason Mo <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: you06 <[email protected]>
Signed-off-by: disksing <[email protected]>
Signed-off-by: husharp <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: glorv <[email protected]>
Signed-off-by: iosmanthus <[email protected]>
Co-authored-by: 王超 <[email protected]>
Co-authored-by: crazycs <[email protected]>
Co-authored-by: MyonKeminta <[email protected]>
Co-authored-by: MyonKeminta <[email protected]>
Co-authored-by: ShuNing <[email protected]>
Co-authored-by: Spade  A <[email protected]>
Co-authored-by: guo-shaoge <[email protected]>
Co-authored-by: disksing <[email protected]>
Co-authored-by: Hangjie Mo <[email protected]>
Co-authored-by: weedge <[email protected]>
Co-authored-by: you06 <[email protected]>
Co-authored-by: Hu# <[email protected]>
Co-authored-by: cfzjywxk <[email protected]>
Co-authored-by: glorv <[email protected]>
cfzjywxk pushed a commit that referenced this pull request Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants