Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV slow initialization scan and stuck cause ticdc replication stuck #3110

Closed
amyangfei opened this issue Oct 20, 2021 · 2 comments
Closed
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. component/tikv TiKV component. severity/critical type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement.

Comments

@amyangfei
Copy link
Contributor

amyangfei commented Oct 20, 2021

What did you do?

  1. Setup a TiDB cluster (6 TiKV nodes) with one large table, with more than 100k leader regions.
  2. Create a TiCDC changefeed to replicate the table in step-1
  3. Start a sysbench script (oltp_insert) in upstream to simulate some workload, with 200QPS.
  4. Use systemctl restart to restart one of the TiKV node (172.16.6.139 in this case, at 2021/10/20 16:13:38.581 +08:00)
  5. Observe the TiCDC replication lag

What did you expect to see?

TiCDC replication can become normal in less than 10minutes.

What did you see instead?

One of the TiKV node suffered slow initialization scan.

image

What's more, two TiKV seems to be stuck with region initialization.

image

The replication doesn't recover after 1 hour.

cdc log:

cdc.log.tar.gz

TiKV logs:
issue-3110-tikv.log.tar.gz

TiCDC metrics:

Test-Cluster-TiCDC-master-20211020_2021-10-20T09_12_04.064Z.zip

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

TiKV v5.2.1

TiCDC version (execute cdc version):

Release Version: v5.2.0-master
Git Commit Hash: 1504c489e88b18a4391167885a858fc318363456
Git Branch: master
UTC Build Time: 2021-10-20 02:28:46
Go Version: go version go1.16.4 linux/amd64
Failpoint Build: false
@amyangfei amyangfei added type/bug The issue is confirmed as a bug. component/tikv TiKV component. labels Oct 20, 2021
@amyangfei amyangfei changed the title TiKV slow initialization scan TiKV slow initialization scan and stuck cause replication interruption Oct 20, 2021
@amyangfei amyangfei changed the title TiKV slow initialization scan and stuck cause replication interruption TiKV slow initialization scan and stuck cause replication block Oct 20, 2021
@amyangfei amyangfei changed the title TiKV slow initialization scan and stuck cause replication block TiKV slow initialization scan and stuck cause ticdc replication stuck Oct 20, 2021
@amyangfei
Copy link
Contributor Author

amyangfei commented Oct 21, 2021

Have observed other stuck issue, with 200qps and no other operations in upstream, the checkpoint-ts of TiCDC stops forwarding, and some regions are reconnected and suffer slow initialization, such as

image
20211021-103740

@amyangfei
Copy link
Contributor Author

closed by #3118

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. component/tikv TiKV component. severity/critical type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

4 participants