kv: Rangefeeds appear to be stuck #86818

miretskiy · 2022-08-24T21:38:30Z

This has been observed in large scale deployments;
See https://github.com/cockroachlabs/support/issues/1729

The cause of changefeed/rangefeed stuckness is not understood. However, what should
never happen is the following stacks:

goroutine 2980779476 [select, 107 minutes]:
google.golang.org/grpc/internal/transport.(*recvBufferReader).readClient(0xc06d9f8cd0, {0xc04cda7a98, 0x5, 0x5})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:190 +0xaa
google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc06d9f8cd0, {0xc04cda7a98, 0xc06bf566f0, 0xc08a7d3128})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:170 +0x147
google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc016293e60, {0xc04cda7a98, 0xc08a7d31a0, 0xa4f2c7})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:484 +0x32
io.ReadAtLeast({0x62a59c0, 0xc016293e60}, {0xc04cda7a98, 0x5, 0x5}, 0x5)
    GOROOT/src/io/io.go:328 +0x9a
io.ReadFull(...)
    GOROOT/src/io/io.go:347
google.golang.org/grpc/internal/transport.(*Stream).Read(0xc00cf6cfc0, {0xc04cda7a98, 0x5, 0x5})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:468 +0xa5
google.golang.org/grpc.(*parser).recvMsg(0xc04cda7a88, 0x7fffffff)
    google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:559 +0x47
google.golang.org/grpc.recvAndDecompress(0x58, 0xc00cf6cfc0, {0x0, 0x0}, 0x7fffffff, 0xc08a7d3458, {0x62e5030, 0x9b3b708})
    google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:690 +0x66
google.golang.org/grpc.recv(0x62c4688, {0x7f92ab1e4980, 0xc000483a90}, 0x7f92a1156cf8, {0x0, 0x0}, {0x4d32d40, 0xc06ce1db60}, 0xb, 0xc08a7d3458, ...)
    google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:756 +0x6e
google.golang.org/grpc.(*csAttempt).recvMsg(0xc025850840, {0x4d32d40, 0xc06ce1db60}, 0x0)
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:975 +0x2b0
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x0)
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:826 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc020214b00, 0xc08a7d3590, 0xc08a7d3560)
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:680 +0x2f6
google.golang.org/grpc.(*clientStream).RecvMsg(0xc020214b00, {0x4d32d40, 0xc06ce1db60})
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:825 +0x11f
github.com/cockroachdb/cockroach/pkg/util/tracing.(*tracingClientStream).RecvMsg(0xc0398cfc60, {0x4d32d40, 0xc06ce1db60})
    github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:440 +0x37
github.com/cockroachdb/cockroach/pkg/roachpb.(*internalRangeFeedClient).Recv(0xc07aff97c0)
    github.com/cockroachdb/cockroach/pkg/roachpb/bazel-out/k8-opt/bin/pkg/roachpb/roachpb_go_proto_/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9284 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).singleRangeFeed(0xc00067ed80, {0x6345010, 0xc054e938c0}, {{0xc03f46e980, 0xf, 0x10}, {0xc0148a2580, 0xf, 0x10}}, {0x170dfa486f5eae51, ...}, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:465 +0xae3
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).partialRangeFeed(0xc00067ed80, {0x6345010, 0xc054e938c0}, 0xc08d0462a0, {{0xc03f46e980, 0xf, 0x10}, {0xc0148a2580, 0xf, 0x10}}, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:315 +0x6fb
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeed.func1.1({0x6345010, 0xc054e938c0})
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:110 +0xbe
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
    github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
    golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
    golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:54 +0x92

Dist sender should never be blocked for 107 minutes in (gRPC) Recv since each range should be producing either the events, or
range checkpoints (every kv.closed_timestamp.side_transport_interval interval).

This is a repeat of an issue we have observed about a year ago for the same customer.
It seems that this happens when there is significant activity happening, with range getting split/moved (possibly to different
nodes and/or stores). There seems to be some sort of a race where rangefeed is not disconnected; it kind of remains in the zombie state where there are matching go routines on the server side, but nothing is being emitted -- thus causing stuckness.

We should add defense in depth mechanism to dist sender (being worked on)
And we should also figure out what is going on.

Jira issue: CRDB-18946

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-08-24T21:38:33Z

cc @cockroachdb/cdc

blathers-crl · 2022-08-29T10:21:39Z

cc @cockroachdb/replication

erikgrinaker · 2023-08-04T13:54:25Z

We're hopeful this may have been addressed by #106053, since we've seen this on multi-store nodes. However, we'll keep this issue open until we're confident it has been resolved. We've since added a stuck rangefeed watcher, which will restart rangefeeds that don't emit events, masking the problem. The way to confirm this would be to inspect the telemetry for stuck rangefeed restarts (rangefeed.stuck.after-catchup-scan). Unfortunately, we don't have corresponding metrics, we should consider adding this too.

@miretskiy Is the above telemetry sufficient here, or do we need metrics as well?

miretskiy · 2023-08-04T21:45:32Z

We have "distsender.rangefeed.restart_stuck" metric too.
I think we're good to go. It should be noted, that stuck watcher was recently removed from mux rangefeed (but that's another story)

erikgrinaker · 2023-10-02T12:40:20Z

Given the amount of changes in this area recently, I'm closing this until we have further reports of stuck rangefeeds.

miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc T-kv KV Team labels Aug 24, 2022

miretskiy assigned tbg and miretskiy Aug 24, 2022

exalate-issue-sync bot unassigned miretskiy Aug 24, 2022

exalate-issue-sync bot removed the T-kv KV Team label Aug 24, 2022

tbg mentioned this issue Aug 25, 2022

kvcoord: restart stuck RangeFeeds #86820

Merged

tbg added the T-kv-replication label Aug 29, 2022

exalate-issue-sync bot removed the T-kv-replication label Aug 29, 2022

shermanCRL mentioned this issue Aug 31, 2022

changefeeds: ranges getting “stuck" #87237

Closed

miretskiy mentioned this issue Sep 6, 2022

changefeedccl: investigate stuck rangefeed bug #85539

Closed

exalate-issue-sync bot assigned erikgrinaker and unassigned tbg Oct 4, 2022

erikgrinaker mentioned this issue Nov 28, 2022

kvcoord: stuckRangeFeedCanceler can fire during event processing #92570

Closed

tbg mentioned this issue Mar 22, 2023

rangefeed: de-flake TestUnrecoverableErrors #99112

Closed

exalate-issue-sync bot assigned aliher1911 and unassigned erikgrinaker Mar 28, 2023

erikgrinaker closed this as completed Oct 2, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: Rangefeeds appear to be stuck #86818

kv: Rangefeeds appear to be stuck #86818

miretskiy commented Aug 24, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Aug 24, 2022

blathers-crl bot commented Aug 29, 2022

erikgrinaker commented Aug 4, 2023

miretskiy commented Aug 4, 2023

erikgrinaker commented Oct 2, 2023

kv: Rangefeeds appear to be stuck #86818

kv: Rangefeeds appear to be stuck #86818

Comments

miretskiy commented Aug 24, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Aug 24, 2022

blathers-crl bot commented Aug 29, 2022

erikgrinaker commented Aug 4, 2023

miretskiy commented Aug 4, 2023

erikgrinaker commented Oct 2, 2023

miretskiy commented Aug 24, 2022 •

edited by exalate-issue-sync bot

Loading