-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
two of ticdc restart repeatedly after injection network partition between pdleader and pd followers #9565
Comments
/remove-area dm |
/severity critical |
/assign asddongmen |
Investigation TiCDC can’t create a PDClient. [2023/08/11 19:41:23.768 +08:00] [INFO] [pd_service_discovery.go:435] ["[pd] cannot update member from this address"] [address=http://tc-pd:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/11 19:45:10.893 +08:00] [INFO] [pd_service_discovery.go:435] ["[pd] cannot update member from this address"] [address=http://tc-pd:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/11 19:45:14.898 +08:00] [ERROR] [capture.go:315] ["reset capture failed"] [error="[PD:client:ErrClientGetMember]get member failed"] [errorVerbose="[PD:client:ErrClientGetMember]get member failed\ngithub.com/tikv/pd/client.(*pdServiceDiscovery).initRetry\n\tgithub.com/tikv/pd/[email protected]/pd_service_discovery.go:199\ngithub.com/tikv/pd/client.(*pdServiceDiscovery).Init\n\tgithub.com/tikv/pd/[email protected]/pd_service_discovery.go:171\ngithub.com/tikv/pd/client.(*client).setup\n\tgithub.com/tikv/pd/[email protected]/client.go:350\ngithub.com/tikv/pd/client.NewClientWithKeyspace\n\tgithub.com/tikv/pd/[email protected]/client.go:340\ngithub.com/tikv/pd/client.NewClientWithContext\n\tgithub.com/tikv/pd/[email protected]/client.go:306\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:123\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:92\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:230\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:288\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:345\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
The error log seems report by: https://github.com/tikv/pd/blob/f1d1a80feb955f0521c011bb133076c012873e85/client/pd_service_discovery.go#L435 tiflow/pkg/upstream/upstream.go Line 126 in 7f42fce
Workaround Pass all PD address when create cdc server instead of one pd proxy address. |
What did you do?
1、run tpcc with 1000 warehouse and 10 thread
2、inject network partition between pdleader and pd followers last for 10mins
3、after 10mins,recover fault
What did you expect to see?
all ticdc are normal
What did you see instead?
one of ticdc restart repeatedly even after fault recovery
[2023/08/11 19:46:05.183 +08:00] [INFO] [capture.go:308] ["the capture routine has exited"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:214] ["exit tso dispatcher loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:375] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:162] ["exit tso requests cancel loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:162] ["exit tso requests cancel loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:311] ["[tso] exit tso dispatcher"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:214] ["exit tso dispatcher loop"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:375] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [INFO] [tso_dispatcher.go:311] ["[tso] exit tso dispatcher"] [dc-location=global]
[2023/08/11 19:46:05.183 +08:00] [ERROR] [server.go:286] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] [errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:286\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2023/08/11 19:46:05.183 +08:00] [WARN] [server.go:140] ["cdc server exits with error"] [error="[CDC:ErrReachMaxTry]reach maximum try: 10, error: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later"] [errorVerbose="[CDC:ErrReachMaxTry]reach maximum try: 10, error: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later: [CDC:ErrCheckClusterVersionFromPD]failed to request PD 503 Service Unavailable no leader\n, please try again later\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:69\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/pkg/version.CheckClusterVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:93\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:157\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:92\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:230\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:288\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:345\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2023/08/11 19:46:05.183 +08:00] [INFO] [capture.go:688] ["message router closed"] [captureID=6fba15d5-480c-43a7-9dd0-121b539fd12f]
[2023/08/11 19:46:05.183 +08:00] [INFO] [server.go:412] ["sort engine manager closed"] [duration=7.455µs]
[2023/08/11 19:47:35.650 +08:00] [INFO] [helper.go:54] ["init log"] [file=/var/lib/ticdc/log/ticdc.log] [level=info]
Versions of the cluster
[2023/08/11 19:47:35.650 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v7.4.0-alpha] [git-hash=0eef200142f4922736cf9e5ea1067a4d6c617329] [git-branch=heads/refs/tags/v7.4.0-alpha] [utc-build-time="2023-08-11 11:36:27"] [go-version="go version go1.21.0 linux/amd64"] [failpoint-build=false]
current status of DM cluster (execute
query-status <task-name>
in dmctl)No response
The text was updated successfully, but these errors were encountered: