Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: potential closed timestamp deadlock #100468

Closed
erikgrinaker opened this issue Apr 3, 2023 · 7 comments
Closed

kvserver: potential closed timestamp deadlock #100468

erikgrinaker opened this issue Apr 3, 2023 · 7 comments
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered).

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 3, 2023

Seen on release-23.1 in #99560 (comment). Unclear if it's an issue with mux rangefeeds or something else. The closed timestamp smearing in #98192 hasn't been backported to 23.1 yet.

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 435165905 lock 0xc002352d50
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:202 kvserver.(*Replica).RangeFeed ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:201 kvserver.(*Replica).RangeFeed ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:2828 kvserver.(*Store).RangeFeed ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:227 kvserver.(*Stores).RangeFeed ???
github.com/cockroachdb/cockroach/pkg/server/node.go:1531 server.(*Node).MuxRangeFeed ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1304 rpc.internalClientAdapter.MuxRangeFeed.func3.1 ???
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:164 grpcinterceptor.StreamServerInterceptor.func1 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:977 rpc.serverStreamInterceptorsChain.run.func1 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:284 rpc.NewServerEx.func4 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:980 rpc.serverStreamInterceptorsChain.run.func1 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:157 rpc.kvAuth.streamInterceptor ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:980 rpc.serverStreamInterceptorsChain.run.func1 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:251 rpc.NewServerEx.func2.1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:321 stop.(*Stopper).RunTaskWithErr ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:250 rpc.NewServerEx.func2 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:980 rpc.serverStreamInterceptorsChain.run.func1 ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:982 rpc.serverStreamInterceptorsChain.run ???
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1309 rpc.internalClientAdapter.MuxRangeFeed.func3 ???

Have been trying to lock it again for more than 5m0s
goroutine 427112695 lock 0xc002352d50
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:677 kvserver.(*Replica).handleClosedTimestampUpdate ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:676 kvserver.(*Replica).handleClosedTimestampUpdate ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:2246 kvserver.(*Store).startRangefeedUpdater.func1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

Here is what goroutine 435165905 doing now
goroutine 435165905 [select, 5 minutes]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Processor).syncEventC(0xc00994f800)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/processor.go:690 +0x1d6
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Processor).Register(0xc00994f800, {{0xc024682d60, 0x2, 0x8}, {0xc024682d68, 0x2, 0x8}}, {0x17516ea86c54f09c, 0x0, 0x0}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/processor.go:487 +0x76
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).registerWithRangefeedRaftMuLocked(0xc002352c80, {0x7181be8, 0xc021927890}, {{0xc024682d60, 0x2, 0x8}, {0xc024682d68, 0x2, 0x8}}, {0x17516ea86c54f09c, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:337 +0x24e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).RangeFeed(0xc002352c80, 0xc01b5c9680, {0x7155b08, 0xc021927860}, 0xc0336bd8f0)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:224 +0x595
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).RangeFeed(0xc02df1ca80, 0xc01b5c9680, {0x7155b08, 0xc021927860})
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:2829 +0x105
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).RangeFeed(0x7181be8?, 0xc01b5c9680, {0x7155b08, 0xc021927860})
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:228 +0xf6
github.com/cockroachdb/cockroach/pkg/server.(*Node).MuxRangeFeed(0xc0068b7500, {0x71ca7c0?, 0xc01e2e3d80})
github.com/cockroachdb/cockroach/pkg/server/node.go:1532 +0x267
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.MuxRangeFeed.func3.1({0x0?, 0x7181be8?}, {0x71b74e8?, 0xc019df63a0?})
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1305 +0x62
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1({0x5b0c500, 0xc0068b7500}, {0x71b3828?, 0xc005a380c0?}, 0xa0ac2d0, 0xc01a01dbc0)
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:164 +0x6c4
github.com/cockroachdb/cockroach/pkg/rpc.serverStreamInterceptorsChain.run.func1({0x5b0c500?, 0xc0068b7500?}, {0x71b3828?, 0xc005a380c0?})
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:978 +0x5b
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func4({0x5b0c500, 0xc0068b7500}, {0x71b3828, 0xc005a380c0}, 0xc005a38090?, 0xc03f6bb6c0)
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:285 +0x74
github.com/cockroachdb/cockroach/pkg/rpc.serverStreamInterceptorsChain.run.func1({0x5b0c500?, 0xc0068b7500?}, {0x71b3828?, 0xc005a380c0?})
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:981 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.streamInterceptor({0xc024ef4000?, {{0x55d4c60?}, {0x71a3ce0?, 0xc01f17f5c0?}}}, {0x5b0c500, 0xc0068b7500}, {0x71b3828, 0xc005a380c0}, 0xa0ac2d0, 0xc03f6bb6c0)
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:158 +0x373
github.com/cockroachdb/cockroach/pkg/rpc.serverStreamInterceptorsChain.run.func1({0x5b0c500?, 0xc0068b7500?}, {0x71b3828?, 0xc005a380c0?})
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:981 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2.1({0xc023a6a680?, 0x0?})
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:252 +0x2d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc023a6a680, {0x7181be8, 0xc005a38090}, {0xc0024fdec8?, 0x462ebf?}, 0xc0024fde80)
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2({0x5b0c500, 0xc0068b7500}, {0x71b3828?, 0xc005a380c0?}, 0xa0ac2d0, 0xc03f6bb6c0)
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:251 +0xef
github.com/cockroachdb/cockroach/pkg/rpc.serverStreamInterceptorsChain.run.func1({0x5b0c500?, 0xc0068b7500?}, {0x71b3828?, 0xc005a380c0?})
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:981 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.serverStreamInterceptorsChain.run({0xc01f0c3a00, 0x4, 0x4}, {0x5b0c500, 0xc0068b7500}, {0x71b3828, 0xc005a380c0}, 0xa0ac2d0, 0xc01a01dbc0)
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:983 +0x142
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.MuxRangeFeed.func3()
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1310 +0xfe
created by github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.MuxRangeFeed
github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:1295 +0x3aa

Jira issue: CRDB-26453

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker labels Apr 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 3, 2023

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). labels Apr 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 3, 2023

cc @cockroachdb/replication

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 3, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
@pav-kv
Copy link
Collaborator

pav-kv commented Apr 3, 2023

The goroutine that holds the lock is stuck on reading from a channel:

Potentially some event consumer doesn't "ack" the response channel, or returns early before doing so (e.g. due to a timeout or something).

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Make sure mux rangefeed uses correct context if it needs
to restart range feeds.

Fix catchup reservation metric accounting in mux rangefeed.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 4, 2023
Restart ranges on a dedicated goroutine (if needed).
Fix logic bug in stuck range handling.
Increase verbosity of logging to help debug mux rangefeed issues.

Informs cockroachdb#99560
Informs cockroachdb#99640
Informs cockroachdb#99214
Informs cockroachdb#98925
Informs cockroachdb#99092
Informs cockroachdb#99212
Informs cockroachdb#99910
Informs cockroachdb#99560
Informs cockroachdb#100468

Release note: None
@nicktrav
Copy link
Collaborator

nicktrav commented Apr 5, 2023

@miretskiy - assigning to you, as discussed.

@erikgrinaker
Copy link
Contributor Author

@miretskiy Has this been addressed by the recent fixes?

@miretskiy
Copy link
Contributor

I think we can close this particular issue if for no other reason than mux rf disabled in tests on 23.1.
I have not seen this error or any new deadlock-y type errors on master in a while -- so fingers crossed.

Do you feel comfortable closing this issue, or you'd prefer to keep it around?

@erikgrinaker
Copy link
Contributor Author

Sure, let's do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

No branches or pull requests

4 participants