Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed stucks "redo manager fails to close writer" #9172

Closed
fubinzh opened this issue Jun 12, 2023 · 3 comments · Fixed by #9276 or #9281
Closed

changefeed stucks "redo manager fails to close writer" #9172

fubinzh opened this issue Jun 12, 2023 · 3 comments · Fixed by #9276 or #9281
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.2 area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/critical type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Jun 12, 2023

What did you do?

  1. run test case: redo_apply_multiple_times, which do following steps:
  • create redo log changefeed
  • run workload "/go-tpc tpcc --db workload --warehouses 200 -T 50 --host upstream-tidb.cdc-redo-long-duration-tps-1806299-1-173 --port 4000 --parts 1 prepare"
  • wait changefeed advnance

What did you expect to see?

  1. changefeed should advance and no panic

What did you see instead?

  1. changefeed stucks, "redo manager fails to close writer" for 1.5 hours, and then "The changefeed won't be restarted as it has been experiencing failures for an extended duration"] [maxElapsedTime=1h30m0s]"
[2023/06/10 14:44:28.995 +00:00] [ERROR] [feed_state_manager.go:403] ["processor reports an error"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [captureID=ec0c65e2-c365-4436-a863-c551613ef791] [error="{\"time\":\"2023-06-10T14:44:28.994331503Z\",\"addr\":\"upstream-ticdc-0.upstream-ticdc-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:8301\",\"code\":\"CDC:ErrInvalidCheckpointTs\",\"message\":\"[CDC:ErrInvalidCheckpointTs]checkpointTs(442081774061486082) should not larger than resolvedTs(442081769683681281)\"}"]
[2023/06/10 14:44:28.995 +00:00] [INFO] [ddl_puller.go:667] ["close the ddl puller"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g_ddl_puller]
[2023/06/10 14:44:28.995 +00:00] [INFO] [meta_manager.go:423] ["redo metaManager bgFlushMeta exits"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).bgFlushMeta\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:433\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run.func1\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:158\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]
[2023/06/10 14:44:28.995 +00:00] [WARN] [encoding_worker.go:128] ["redo fileWorkerGroup closed with error"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*encodingWorkerGroup).runWorker\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/encoding_worker.go:153\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*encodingWorkerGroup).Run.func2\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/encoding_worker.go:135\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]
[2023/06/10 14:44:28.995 +00:00] [WARN] [file_worker.go:142] ["redo file workers closed with error"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*fileWorkerGroup).bgWriteLogs\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/file_worker.go:193\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*fileWorkerGroup).Run.func2\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/file_worker.go:148\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]
[2023/06/10 14:44:28.995 +00:00] [ERROR] [manager.go:522] ["redo manager fails to close writer"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*encodingWorkerGroup).runWorker\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/encoding_worker.go:153\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*encodingWorkerGroup).Run.func2\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/encoding_worker.go:135\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]

...

[2023/06/10 15:28:34.950 +00:00] [ERROR] [manager.go:522] ["redo manager fails to close writer"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*fileWorkerGroup).bgFlushFileCache\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/file_worker.go:170\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.(*fileWorkerGroup).Run.func3\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/file_worker.go:152\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]
[2023/06/10 15:28:34.950 +00:00] [INFO] [manager.go:527] ["redo manager closed"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g]
[2023/06/10 15:28:34.950 +00:00] [INFO] [processor.go:1037] ["processor sub-component is in stopping"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [name=MounterGroup]
[2023/06/10 15:28:34.950 +00:00] [INFO] [processor.go:1037] ["processor sub-component is in stopping"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [name=ddlHandler]
[2023/06/10 15:28:34.996 +00:00] [INFO] [client.go:987] ["stream to store closed"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [addr=upstream-tikv-1.upstream-tikv-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:20160] [storeID=12]
[2023/06/10 15:28:34.997 +00:00] [INFO] [client.go:987] ["stream to store closed"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [addr=upstream-tikv-1.upstream-tikv-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:20160] [storeID=12]
[2023/06/10 15:28:35.001 +00:00] [INFO] [client.go:987] ["stream to store closed"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [addr=upstream-tikv-1.upstream-tikv-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:20160] [storeID=12]
[2023/06/10 15:28:35.001 +00:00] [INFO] [processor.go:927] ["Processor drop sort engine successfully"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g]
[2023/06/10 15:28:35.001 +00:00] [INFO] [processor.go:933] ["Processor try to close agent"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g]
[2023/06/10 15:28:35.001 +00:00] [WARN] [server.go:478] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] [errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:262\nruntime.goexit\n\truntime/asm_amd64.s:1598"]
[2023/06/10 15:28:35.001 +00:00] [INFO] [processor.go:939] ["Processor closed agent successfully"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g]
[2023/06/10 15:28:35.001 +00:00] [INFO] [processor.go:949] ["processor closed"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g]
[2023/06/10 15:28:35.001 +00:00] [WARN] [server.go:284] ["handler not found"] [topic=changefeed/default/redo-apply-multiple-ci28n65tojtur782h43g/agent]
[2023/06/10 15:30:16.873 +00:00] [INFO] [grpc_pool_impl.go:244] ["recycle connections in grpc pool"] [address=upstream-tikv-2.upstream-tikv-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:20160]
[2023/06/10 15:30:16.873 +00:00] [INFO] [grpc_pool_impl.go:244] ["recycle connections in grpc pool"] [address=upstream-tikv-0.upstream-tikv-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:20160]
[2023/06/10 15:30:16.874 +00:00] [INFO] [grpc_pool_impl.go:244] ["recycle connections in grpc pool"] [address=upstream-tikv-1.upstream-tikv-peer.cdc-redo-long-duration-tps-1806299-1-173.svc:20160]
[2023/06/10 15:58:26.094 +00:00] [WARN] [feed_state_manager.go:547] ["The changefeed won't be restarted as it has been experiencing failures for an extended duration"] [maxElapsedTime=1h30m0s]
  1. lots of error "sinkManager: resolved ts should not less than checkpoint ts" seen
[2023/06/10 14:44:40.778 +00:00] [ERROR] [manager.go:941] ["sinkManager: resolved ts should not less than checkpoint ts"] [namespace=default] [changefeed=redo-apply-multiple-ci28n65tojtur782h43g] [span={table_id:91,start_key:7480000000000000ff5b5f720000000000fa,end_key:7480000000000000ff5b5f730000000000fa}] [resolvedTs=442081774048378883] [checkpointTs="{\"Mode\":0,\"Ts\":442081774152974342,\"BatchID\":18446744073709551615}"]

Uploading ticdc_1_1806299.log…

Versions of the cluster

cdc version:
["Welcome to Change Data Capture (CDC)"] [release-version=v7.2.0-alpha] [git-hash=c6bebdeb756344564aa140ada7f8db9206e9745a] [git-branch=heads/refs/tags/v7.2.0-alpha] [utc-build-time="2023-06-10 11:33:58"] [go-version="go version go1.20.3 linux/amd64"] [failpoint-build=false]

@fubinzh fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Jun 12, 2023
@fubinzh
Copy link
Author

fubinzh commented Jun 12, 2023

/severity critical

@nongfushanquan
Copy link
Contributor

/assign @CharlesCheung96

@fubinzh
Copy link
Author

fubinzh commented Jun 16, 2023

/found automation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.2 area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
4 participants