Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two of ticdc restart repeatedly after injection network partition between pdleader and pd followers #9565

Closed
Lily2025 opened this issue Aug 14, 2023 · 4 comments · Fixed by #9661
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Aug 14, 2023

What did you do?

1、run tpcc with 1000 warehouse and 10 thread
2、inject network partition between pdleader and pd followers last for 10mins
3、after 10mins,recover fault

What did you expect to see?

all ticdc are normal

What did you see instead?

one of ticdc restart repeatedly even after fault recovery

image

[2023/08/11 19:46:05.183 +08:00] [INFO] [capture.go:308] ["the capture routine has exited"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:214] ["exit tso dispatcher loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:375] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:162] ["exit tso requests cancel loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:162] ["exit tso requests cancel loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:311] ["[tso] exit tso dispatcher"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:214] ["exit tso dispatcher loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:375] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:311] ["[tso] exit tso dispatcher"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [ERROR] [server.go:286] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] [errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:286\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2023/08/11 19:46:05.183 +08:00] [WARN] [server.go:140] ["cdc server exits with error"] [error="[CDC:ErrReachMaxTry]reach maximum try: 10, error: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later"] [errorVerbose="[CDC:ErrReachMaxTry]reach maximum try: 10, error: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:69\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/pkg/version.CheckClusterVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:93\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:157\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:92\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:230\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:288\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:345\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [capture.go:688] ["message router closed"] [captureID=6fba15d5-480c-43a7-9dd0-121b539fd12f]
[2023/08/11 19:46:05.183 +08:00] [INFO] [server.go:412] ["sort engine manager closed"] [duration=7.455µs]
[2023/08/11 19:47:35.650 +08:00] [INFO] [helper.go:54] ["init log"] [file=/var/lib/ticdc/log/ticdc.log] [level=info]

Versions of the cluster

[2023/08/11 19:47:35.650 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v7.4.0-alpha] [git-hash=0eef200142f4922736cf9e5ea1067a4d6c617329] [git-branch=heads/refs/tags/v7.4.0-alpha] [utc-build-time="2023-08-11 11:36:27"] [go-version="go version go1.21.0 linux/amd64"] [failpoint-build=false]

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

@Lily2025 Lily2025 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Aug 14, 2023
@Lily2025
Copy link
Author

/remove-area dm
/area ticdc

@ti-chi-bot ti-chi-bot bot added area/ticdc Issues or PRs related to TiCDC. and removed area/dm Issues or PRs related to DM. labels Aug 14, 2023
@Lily2025
Copy link
Author

/severity critical

@Lily2025
Copy link
Author

/assign asddongmen

@Lily2025 Lily2025 changed the title one of ticdc restart repeatedly after injection network partition between pdleader and pd followers two of ticdc restart repeatedly after injection network partition between pdleader and pd followers Aug 14, 2023
@asddongmen
Copy link
Contributor

asddongmen commented Aug 14, 2023

Investigation

TiCDC can’t create a PDClient.

[2023/08/11 19:41:23.768 +08:00] [INFO] [pd_service_discovery.go:435] ["[pd] cannot update member from this address"] [address=http://tc-pd:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/11 19:45:10.893 +08:00] [INFO] [pd_service_discovery.go:435] ["[pd] cannot update member from this address"] [address=http://tc-pd:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/11 19:45:14.898 +08:00] [ERROR] [capture.go:315] ["reset capture failed"] [error="[PD:client:ErrClientGetMember]get member failed"] [errorVerbose="[PD:client:ErrClientGetMember]get member failed\ngithub.com/tikv/pd/client.(*pdServiceDiscovery).initRetry\n\tgithub.com/tikv/pd/[email protected]/pd_service_discovery.go:199\ngithub.com/tikv/pd/client.(*pdServiceDiscovery).Init\n\tgithub.com/tikv/pd/[email protected]/pd_service_discovery.go:171\ngithub.com/tikv/pd/client.(*client).setup\n\tgithub.com/tikv/pd/[email protected]/client.go:350\ngithub.com/tikv/pd/client.NewClientWithKeyspace\n\tgithub.com/tikv/pd/[email protected]/client.go:340\ngithub.com/tikv/pd/client.NewClientWithContext\n\tgithub.com/tikv/pd/[email protected]/client.go:306\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:123\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:92\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:230\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:288\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:345\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"]

The error log seems report by: https://github.com/tikv/pd/blob/f1d1a80feb955f0521c011bb133076c012873e85/client/pd_service_discovery.go#L435
There is the place TiCDC try to create a PDClient:

up.PDClient, err = pd.NewClientWithContext(

Workaround

Pass all PD address when create cdc server instead of one pd proxy address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants