Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

region not receiving event from tikv for too long time cause the batch resolvedts event is dropped #1599

Closed
zier-one opened this issue Apr 1, 2021 · 9 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/critical subject/replication-interruption Denotes an issue or pull request is related to replication interruption. type/bug The issue is confirmed as a bug.

Comments

@zier-one
Copy link
Contributor

zier-one commented Apr 1, 2021

Bug Report

[2021/04/01 15:19:42.978 +08:00] [WARN] [client.go:1299] ["region not found"] [regionID=1069410]
[2021/04/01 15:19:43.440 +08:00] [WARN] [client.go:1400] ["region not receiving event from tikv for too long time"] [regionID=1069410] [span="[7480000000000000ff4b5f728000000001ff6e32d50000000000fa, 7480000000000000ff4b5f728000000001ff6e4cdf0000000000fa)"] [duration=2m47.71997726s]
[2021/04/01 15:19:43.443 +08:00] [WARN] [client.go:1411] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=1069410] [span="[7480000000000000ff4b5f728000000001ff6e32d50000000000fa, 7480000000000000ff4b5f728000000001ff6e4cdf0000000000fa)"] [duration=21m50.8s] [resolvedTs=423955076771741710]
[2021/04/01 15:19:43.445 +08:00] [INFO] [lock_resolver.go:122] ["resolve lock successfully"] [regionID=1069410] [maxVersion=423955417768656896]
[2021/04/01 15:19:48.439 +08:00] [WARN] [client.go:1400] ["region not receiving event from tikv for too long time"] [regionID=1069410] [span="[7480000000000000ff4b5f728000000001ff6e32d50000000000fa, 7480000000000000ff4b5f728000000001ff6e4cdf0000000000fa)"] [duration=2m52.719338532s]
[2021/04/01 15:19:48.439 +08:00] [WARN] [client.go:1411] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=1069410] [span="[7480000000000000ff4b5f728000000001ff6e32d50000000000fa, 7480000000000000ff4b5f728000000001ff6e4cdf0000000000fa)"] [duration=21m54.8s] [resolvedTs=423955076771741710]
[2021/04/01 15:19:48.441 +08:00] [INFO] [lock_resolver.go:122] ["resolve lock successfully"] [regionID=1069410] [maxVersion=423955418817232896]
[2021/04/01 15:19:53.440 +08:00] [WARN] [client.go:1400] ["region not receiving event from tikv for too long time"] [regionID=1069410] [span="[7480000000000000ff4b5f728000000001ff6e32d50000000000fa, 7480000000000000ff4b5f728000000001ff6e4cdf0000000000fa)"] [duration=2m57.719621785s]
[2021/04/01 15:19:53.440 +08:00] [WARN] [client.go:1411] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=1069410] [span="[7480000000000000ff4b5f728000000001ff6e32d50000000000fa, 7480000000000000ff4b5f728000000001ff6e4cdf0000000000fa)"] [duration=22m0.85s] [resolvedTs=423955076771741710]
[2021/04/01 15:19:53.441 +08:00] [INFO] [lock_resolver.go:122] ["resolve lock successfully"] [regionID=1069410] [maxVersion=423955420403204096]
[2021/04/01 15:19:55.873 +08:00] [WARN] [client.go:1299] ["region not found"] [regionID=1069410]
@zier-one zier-one added the type/bug The issue is confirmed as a bug. label Apr 1, 2021
@amyangfei
Copy link
Contributor

index c723262..1631785 100644
--- a/cdc/kv/client.go
+++ b/cdc/kv/client.go
@@ -73,7 +73,7 @@ const (
 // hard code switch
 // true: use kv client v2, which has a region worker for each stream
 // false: use kv client v1, which runs a goroutine for every single region
-var enableKVClientV2 = true
+var enableKVClientV2 = false

 type singleRegionInfo struct {
        verID  tikv.RegionVerID
@@ -1277,6 +1277,7 @@ func (s *eventFeedSession) sendResolvedTs(
        pendingRegions *syncRegionFeedStateMap,
        addr string,
 ) error {
+       log.Debug("batch resolved ts", zap.Reflect("regions", resolvedTs.Regions), zap.Uint64("ts", resolvedTs.Ts))
        for _, regionID := range resolvedTs.Regions {
                state, ok := regionStates[regionID]
                if ok {
@@ -1294,6 +1295,8 @@ func (s *eventFeedSession) sendResolvedTs(
                        case <-ctx.Done():
                                return ctx.Err()
                        }
+               } else {
+                       log.Warn("region not found", zap.Uint64("regionID", regionID), zap.Uint64("ts", resolvedTs.Ts), zap.String("addr", fmt.Sprintf("%p", s)))
                }
        }
        return nil
@@ -1408,6 +1411,7 @@ func (s *eventFeedSession) singleEventFeed(
                                log.Warn("region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock",
                                        zap.Uint64("regionID", regionID), zap.Stringer("span", span),
                                        zap.Duration("duration", sinceLastResolvedTs),
+                                        zap.String("addr", fmt.Sprintf("%p", s)),
                                        zap.Uint64("resolvedTs", lastResolvedTs))
                                maxVersion := oracle.ComposeTS(oracle.GetPhysical(currentTimeFromPD.Add(-10*time.Second)), 0)
                                err = s.lockResolver.Resolve(ctx, regionID, maxVersion)

@amyangfei
Copy link
Contributor

➜  curl -s http://127.0.0.1:8301/metrics |grep ticdc_kvclient_channel_size
# HELP ticdc_kvclient_channel_size size of each channel in kv client
# TYPE ticdc_kvclient_channel_size gauge
ticdc_kvclient_channel_size{channel="err",id="1"} 0
ticdc_kvclient_channel_size{channel="err",id="11"} 16
ticdc_kvclient_channel_size{channel="err",id="11407"} 16
ticdc_kvclient_channel_size{channel="err",id="2"} 0
ticdc_kvclient_channel_size{channel="err",id="38538"} 0
ticdc_kvclient_channel_size{channel="err",id="38539"} 0
ticdc_kvclient_channel_size{channel="err",id="38542"} 16
ticdc_kvclient_channel_size{channel="err",id="38603"} 16
ticdc_kvclient_channel_size{channel="err",id="49925"} 0
ticdc_kvclient_channel_size{channel="err",id="49926"} 0
ticdc_kvclient_channel_size{channel="err",id="49929"} 16
ticdc_kvclient_channel_size{channel="err",id="49970"} 16
ticdc_kvclient_channel_size{channel="err",id="5"} 16
ticdc_kvclient_channel_size{channel="range",id="1"} 0
ticdc_kvclient_channel_size{channel="range",id="11"} 15
ticdc_kvclient_channel_size{channel="range",id="11407"} 0
ticdc_kvclient_channel_size{channel="range",id="2"} 0
ticdc_kvclient_channel_size{channel="range",id="38538"} 0
ticdc_kvclient_channel_size{channel="range",id="38539"} 0
ticdc_kvclient_channel_size{channel="range",id="38542"} 0
ticdc_kvclient_channel_size{channel="range",id="38603"} 0
ticdc_kvclient_channel_size{channel="range",id="49925"} 0
ticdc_kvclient_channel_size{channel="range",id="49926"} 0
ticdc_kvclient_channel_size{channel="range",id="49929"} 16
ticdc_kvclient_channel_size{channel="range",id="49970"} 16
ticdc_kvclient_channel_size{channel="range",id="5"} 0
ticdc_kvclient_channel_size{channel="region",id="1"} 0
ticdc_kvclient_channel_size{channel="region",id="11"} 0
ticdc_kvclient_channel_size{channel="region",id="11407"} 0
ticdc_kvclient_channel_size{channel="region",id="2"} 0
ticdc_kvclient_channel_size{channel="region",id="38538"} 0
ticdc_kvclient_channel_size{channel="region",id="38539"} 0
ticdc_kvclient_channel_size{channel="region",id="38542"} 8
ticdc_kvclient_channel_size{channel="region",id="38603"} 0
ticdc_kvclient_channel_size{channel="region",id="49925"} 0
ticdc_kvclient_channel_size{channel="region",id="49926"} 0
ticdc_kvclient_channel_size{channel="region",id="49929"} 0
ticdc_kvclient_channel_size{channel="region",id="49970"} 0
ticdc_kvclient_channel_size{channel="region",id="5"} 0

@amyangfei
Copy link
Contributor

goroutines:

goroutines.log.tar.gz

@amyangfei
Copy link
Contributor

amyangfei commented Apr 1, 2021

goroutine 56356 [select, 11 minutes]:
github.com/pingcap/ticdc/pkg/regionspan.(*RegionRangeLock).LockRange.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        github.com/pingcap/ticdc/pkg/regionspan/region_range_lock.go:302 +0x1b2
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).scheduleRegionRequest(0xc05ef2c5a0, 0x2fe9ac0, 0xc011ef7280, 0x3530b3, 0x26, 0x3f74, 0xc03d437a40, 0x1b, 0x20, 0xc03d437a60, ...)
        github.com/pingcap/ticdc/cdc/kv/client.go:651 +0x1a6
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions(0xc05ef2c5a0, 0x2fe9ac0, 0xc011ef7280, 0xc02792b8c0, 0x1b, 0x20, 0xc02792b900, 0x1b, 0x20, 0x5e23239a2ec0002, ...)
        github.com/pingcap/ticdc/cdc/kv/client.go:1015 +0x6d8
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func2(0xc0236fcfc8, 0x8bee5b)
        github.com/pingcap/ticdc/cdc/kv/client.go:578 +0x1e6
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc05b739dd0, 0xc029823560)
        golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66
goroutine 56097 [select, 22 minutes]:
github.com/pingcap/ticdc/pkg/regionspan.(*RegionRangeLock).LockRange.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        github.com/pingcap/ticdc/pkg/regionspan/region_range_lock.go:302 +0x1b2
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).scheduleRegionRequest(0xc05e53ec60, 0x2fe9ac0, 0xc067f15100, 0x35f3cc, 0x11, 0x3a49, 0xc012132c60, 0x1b, 0x20, 0xc012132c80, ...)
        github.com/pingcap/ticdc/cdc/kv/client.go:651 +0x1a6
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions(0xc05e53ec60, 0x2fe9ac0, 0xc067f15100, 0xc031363960, 0x1b, 0x20, 0xc031363980, 0x1b, 0x20, 0x5e23243f4540001, ...)
        github.com/pingcap/ticdc/cdc/kv/client.go:1015 +0x6d8
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func2(0xc0063f87c8, 0x8beea5)
        github.com/pingcap/ticdc/cdc/kv/client.go:578 +0x1e6
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc05b738b40, 0xc029822a00)
        golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

@amyangfei
Copy link
Contributor

I suspect the block of region span LockRange caused some deadlock

@amyangfei
Copy link
Contributor

This is fixed by above PRs

@dbakit
Copy link

dbakit commented Dec 7, 2021

[2021/12/06 17:48:22.423 +08:00] [INFO] [client.go:868] ["start new request"] [request="{\"header\":{\"cluster_id\":6984216311785899551,\"ticdc_version\":\"5.1.0\"},\"region_id\":416013,\"region_epoch\":{\"conf_ver\":5,\"version\":2835},\"checkpoint_ts\":429597134820212740,\"start_key\":\"dIAAAAAAAAL//l9yA4AAAAD/ABbViAEwMTf/MDI5ODn/AAD/AAAAAAAA9wH/MDAxMzg2AAD//QQZqeQAAAD/AAAAAAAAAAD5\",\"end_key\":\"dIAAAAAAAAL//l9yA4AAAAD/ABcXoQEwMTf/NDAzMzT/AAD/AAAAAAAA9wH/MDAxNDEwAAD//QQZqkgAAAD/AAAAAAAAAAD5\",\"request_id\":1600,\"extra_op\":1,\"Request\":null}"] [addr=172.18.244.24:20160]
[2021/12/06 17:49:31.979 +08:00] [WARN] [region_worker.go:307] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=281145] [span="[7480000000000006ffb75f720380000000ff001262a101303131ff3036393130ff0000ff000000000000f701ff3030323431340000fffd00000000000000f8, 7480000000000006ffb75f720380000000ff0014386701303132ff3936323731ff0000ff000000000000f701ff3136333430320000fffd00000000000000f8)"] [duration=19m16.401s] [lastEvent=71.416849ms] [resolvedTs=429597134820212740]
[2021/12/06 17:49:31.982 +08:00] [INFO] [lock_resolver.go:123] ["resolve lock successfully"] [regionID=281145] [maxVersion=429597435342356480]
[2021/12/06 17:49:32.090 +08:00] [WARN] [region_worker.go:307] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=268917] [span="[7480000000000006ff8a5f720000000000fa, 7480000000000006ff8a5f720130313135ff30333339ff000000ff0000000000f70419ffaa9a000000000000fe)"] [duration=19m16.401s] [lastEvent=177.779243ms] [resolvedTs=429597134820212740]
[2021/12/06 17:49:32.093 +08:00] [INFO] [lock_resolver.go:123] ["resolve lock successfully"] [regionID=268917] [maxVersion=429597435342356480]
[2021/12/06 17:49:37.101 +08:00] [WARN] [region_worker.go:307] ["region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock"] [regionID=363045] [span="[7480000000000008ff755f720380000000ff000e152901313136ff3600000000fb0130ff30353638333934ffff0000000000000000fff700000000000000f8, 7480000000000008ff755f720380000000ff001535e501303030ff3800000000fb0130ff31343530363034ffff0000000000000000fff700000000000000f8)"] [duration=19m21.4s] [lastEvent=114.778792ms] [resolvedTs=429597134820212740]
[2021/12/06 17:49:37.103 +08:00] [INFO] [lock_resolver.go:123] ["resolve lock successfully"] [regionID=363045] [maxVersion=429597436652814336]

@amyangfei seems still has warnings, I also meet this problem, and my ticdc_version is 5.1.0.

@dbakit
Copy link

dbakit commented Dec 7, 2021

image
according to the monitor, only one of three TiCDC node appears this warning, and its sink write duration is very very large!!

@amyangfei
Copy link
Contributor

Sink write duration reflects transaction execution to downstream, the resolve lock warning is not relevant to it. You can check the following metrics

  • checkpoint-ts to verify whether changefeed still moves forward.
  • table resolved ts to verify whether TiCDC internal data puller works normal.

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/critical subject/replication-interruption Denotes an issue or pull request is related to replication interruption. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

5 participants