Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: Rangefeeds appear to be stuck #86818

Closed
miretskiy opened this issue Aug 24, 2022 · 5 comments
Closed

kv: Rangefeeds appear to be stuck #86818

miretskiy opened this issue Aug 24, 2022 · 5 comments
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Aug 24, 2022

This has been observed in large scale deployments;
See https://github.com/cockroachlabs/support/issues/1729

The cause of changefeed/rangefeed stuckness is not understood. However, what should
never happen is the following stacks:

goroutine 2980779476 [select, 107 minutes]:
google.golang.org/grpc/internal/transport.(*recvBufferReader).readClient(0xc06d9f8cd0, {0xc04cda7a98, 0x5, 0x5})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:190 +0xaa
google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc06d9f8cd0, {0xc04cda7a98, 0xc06bf566f0, 0xc08a7d3128})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:170 +0x147
google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc016293e60, {0xc04cda7a98, 0xc08a7d31a0, 0xa4f2c7})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:484 +0x32
io.ReadAtLeast({0x62a59c0, 0xc016293e60}, {0xc04cda7a98, 0x5, 0x5}, 0x5)
    GOROOT/src/io/io.go:328 +0x9a
io.ReadFull(...)
    GOROOT/src/io/io.go:347
google.golang.org/grpc/internal/transport.(*Stream).Read(0xc00cf6cfc0, {0xc04cda7a98, 0x5, 0x5})
    google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:468 +0xa5
google.golang.org/grpc.(*parser).recvMsg(0xc04cda7a88, 0x7fffffff)
    google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:559 +0x47
google.golang.org/grpc.recvAndDecompress(0x58, 0xc00cf6cfc0, {0x0, 0x0}, 0x7fffffff, 0xc08a7d3458, {0x62e5030, 0x9b3b708})
    google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:690 +0x66
google.golang.org/grpc.recv(0x62c4688, {0x7f92ab1e4980, 0xc000483a90}, 0x7f92a1156cf8, {0x0, 0x0}, {0x4d32d40, 0xc06ce1db60}, 0xb, 0xc08a7d3458, ...)
    google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:756 +0x6e
google.golang.org/grpc.(*csAttempt).recvMsg(0xc025850840, {0x4d32d40, 0xc06ce1db60}, 0x0)
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:975 +0x2b0
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x0)
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:826 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc020214b00, 0xc08a7d3590, 0xc08a7d3560)
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:680 +0x2f6
google.golang.org/grpc.(*clientStream).RecvMsg(0xc020214b00, {0x4d32d40, 0xc06ce1db60})
    google.golang.org/grpc/external/org_golang_google_grpc/stream.go:825 +0x11f
github.com/cockroachdb/cockroach/pkg/util/tracing.(*tracingClientStream).RecvMsg(0xc0398cfc60, {0x4d32d40, 0xc06ce1db60})
    github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:440 +0x37
github.com/cockroachdb/cockroach/pkg/roachpb.(*internalRangeFeedClient).Recv(0xc07aff97c0)
    github.com/cockroachdb/cockroach/pkg/roachpb/bazel-out/k8-opt/bin/pkg/roachpb/roachpb_go_proto_/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9284 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).singleRangeFeed(0xc00067ed80, {0x6345010, 0xc054e938c0}, {{0xc03f46e980, 0xf, 0x10}, {0xc0148a2580, 0xf, 0x10}}, {0x170dfa486f5eae51, ...}, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:465 +0xae3
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).partialRangeFeed(0xc00067ed80, {0x6345010, 0xc054e938c0}, 0xc08d0462a0, {{0xc03f46e980, 0xf, 0x10}, {0xc0148a2580, 0xf, 0x10}}, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:315 +0x6fb
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeed.func1.1({0x6345010, 0xc054e938c0})
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:110 +0xbe
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
    github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
    golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
    golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:54 +0x92

Dist sender should never be blocked for 107 minutes in (gRPC) Recv since each range should be producing either the events, or
range checkpoints (every kv.closed_timestamp.side_transport_interval interval).

This is a repeat of an issue we have observed about a year ago for the same customer.
It seems that this happens when there is significant activity happening, with range getting split/moved (possibly to different
nodes and/or stores). There seems to be some sort of a race where rangefeed is not disconnected; it kind of remains in the zombie state where there are matching go routines on the server side, but nothing is being emitted -- thus causing stuckness.

We should add defense in depth mechanism to dist sender (being worked on)
And we should also figure out what is going on.

Jira issue: CRDB-18946

@miretskiy miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc T-kv KV Team labels Aug 24, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 24, 2022

cc @cockroachdb/cdc

@blathers-crl
Copy link

blathers-crl bot commented Aug 29, 2022

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

We're hopeful this may have been addressed by #106053, since we've seen this on multi-store nodes. However, we'll keep this issue open until we're confident it has been resolved. We've since added a stuck rangefeed watcher, which will restart rangefeeds that don't emit events, masking the problem. The way to confirm this would be to inspect the telemetry for stuck rangefeed restarts (rangefeed.stuck.after-catchup-scan). Unfortunately, we don't have corresponding metrics, we should consider adding this too.

@miretskiy Is the above telemetry sufficient here, or do we need metrics as well?

@miretskiy
Copy link
Contributor Author

We have "distsender.rangefeed.restart_stuck" metric too.
I think we're good to go. It should be noted, that stuck watcher was recently removed from mux rangefeed (but that's another story)

@erikgrinaker
Copy link
Contributor

Given the amount of changes in this area recently, I'm closing this until we have further reports of stuck rangefeeds.

@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants