Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/flowinfra: panic in rangefeed.(*CatchUpIterator) #114192

Closed
cockroach-teamcity opened this issue Nov 9, 2023 · 8 comments · Fixed by #114240
Closed

sql/flowinfra: panic in rangefeed.(*CatchUpIterator) #114192

cockroach-teamcity opened this issue Nov 9, 2023 · 8 comments · Fixed by #114240
Assignees
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 9, 2023

sql/flowinfra.TestEvalCtxTxnOnRemoteNodes failed with artifacts on release-23.2 @ 5c59f412f63925f930a58392d0a95980b37e0655:

Fatal error:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x19a pc=0x1cb6389]

Stack:

goroutine 4989621 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).recover(0xc00e3327f0?, {0x649f108, 0xc001173110})
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:230 +0x6a
panic({0x49d0f80, 0x8fcedc0})
	GOROOT/src/runtime/panic.go:884 +0x213
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).Close(0xc00e332b60?)
	github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:338 +0x29
github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).Close(0xc01318fb00)
	github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:1068 +0x3d
github.com/cockroachdb/cockroach/pkg/storage.(*MVCCIncrementalIterator).Close(0xc0068d96c0)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc_incremental_iterator.go:278 +0x29
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*CatchUpIterator).Close(0xc009976540)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/catchup_scan.go:114 +0x2d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).registerWithRangefeedRaftMuLocked.func1()
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:395 +0x2a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).registerWithRangefeedRaftMuLocked(0xc012936c80, {0x649f108, 0xc01093a420}, {{0xc00daf16d0, 0x3, 0x8}, {0xc00daf16d8, 0x3, 0x8}}, {0x179611e806e4d03b, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:485 +0xbaf
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).RangeFeed(0xc012936c80, 0xc0197cb1e0, {0x6475a48, 0xc01093a2d0}, 0x3?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rangefeed.go:288 +0x6b0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).RangeFeed(0xc004097500, 0xc0197cb1e0, {0x6475a48, 0xc01093a2d0})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3005 +0x105
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).RangeFeed(0xc000e12c80?, 0xc0197cb1e0, {0x6475a48, 0xc01093a2d0})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:227 +0xf6
github.com/cockroachdb/cockroach/pkg/server.(*Node).MuxRangeFeed(0xc0093cb000, {0x64ddb68?, 0xc01c227f80})
	github.com/cockroachdb/cockroach/pkg/server/node.go:1846 +0x5a6
github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_MuxRangeFeed_Handler({0x5066540?, 0xc0093cb000}, {0x64d1e70?, 0xc000e129a0})
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10457 +0x9f
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1({0x5066540, 0xc0093cb000}, {0x64ce2d0?, 0xc0011739e0?}, 0xc0164e8ff0, 0x5433c90)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:165 +0x6c4
google.golang.org/grpc.getChainStreamHandler.func1({0x5066540, 0xc0093cb000}, {0x64ce2d0, 0xc0011739e0})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1478 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func4({0x5066540, 0xc0093cb000}, {0x64ce2d0, 0xc0011739e0}, 0xc0164e8ff0?, 0xc00df13480)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:178 +0x74
google.golang.org/grpc.getChainStreamHandler.func1({0x5066540, 0xc0093cb000}, {0x64ce2d0, 0xc0011739e0})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1478 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.streamInterceptor({0xc00c990000?, {{0x4398c7?}, {0x64d6b70?, 0xc01a006450?}}}, {0x5066540, 0xc0093cb000}, {0x64d24e8, 0xc004951c00}, 0xc0164e8ff0, 0xc00df13340)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:167 +0x3b1
google.golang.org/grpc.getChainStreamHandler.func1({0x5066540, 0xc0093cb000}, {0x64d24e8, 0xc004951c00})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1478 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2.1({0xc00c7c2480?, 0x441970?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:145 +0x2d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc00c7c2480, {0x649f108, 0xc001173110}, {0xc00043e000?, 0xc00df12cc0?}, 0xc00bc9daa0)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:336 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2({0x5066540, 0xc0093cb000}, {0x64d24e8?, 0xc004951c00?}, 0xc0164e8ff0, 0xc00df12cc0)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:144 +0xef
google.golang.org/grpc.chainStreamInterceptors.func1({0x5066540, 0xc0093cb000}, {0x64d24e8, 0xc004951c00}, 0x47a0300?, 0xc01c298000?)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1469 +0x8f
google.golang.org/grpc.(*Server).processStreamingRPC(0xc00fcde1e0, {0x64e98a0, 0xc019189520}, 0xc0160d27e0, 0xc005a48780, 0x9011f00, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1627 +0x1206
google.golang.org/grpc.(*Server).handleStream(0xc00fcde1e0, {0x64e98a0, 0xc019189520}, 0xc0160d27e0, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1708 +0x9f0
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:965 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:963 +0x28a
Log preceding fatal error

=== RUN   TestEvalCtxTxnOnRemoteNodes
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/fd023fd6c8beefd8db32fbb739554fec/logTestEvalCtxTxnOnRemoteNodes2567333182
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:165: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
=== RUN   TestEvalCtxTxnOnRemoteNodes/vectorize=false
=== RUN   TestEvalCtxTxnOnRemoteNodes/vectorize=true

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-33388

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels Nov 9, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Nov 9, 2023
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Nov 9, 2023
@yuzefovich yuzefovich changed the title sql/flowinfra: TestEvalCtxTxnOnRemoteNodes failed sql/flowinfra: panic in rangefeed.(*CatchUpIterator) Nov 10, 2023
@yuzefovich yuzefovich added T-kv-replication and removed T-sql-queries SQL Queries Team labels Nov 10, 2023
Copy link

blathers-crl bot commented Nov 10, 2023

cc @cockroachdb/replication

@yuzefovich yuzefovich removed this from SQL Queries Nov 10, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Nov 10, 2023
@erikgrinaker erikgrinaker self-assigned this Nov 10, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 10, 2023

Must have called Close() on a nil iterator:

We check that catchupIter != nil higher up the stack, but not that catchupIter.simpleCatchupIter is:

if catchUpIter != nil {
catchUpIter.Close()
}

However, it is set non-nil during construction and never written to anywhere else:

simpleCatchupIter: iter,

The test doesn't do anything with server internals to provoke this either.

@erikgrinaker
Copy link
Contributor

Relevant metamorphic parameters:

initialized metamorphic constant "kv.rangefeed.use_dedicated_connection_class.enabled" with value true 
initialized metamorphic constant "use-mux-rangefeed" with value true 
initialized metamorphic constant "kv_rangefeed_scheduler_enabled" with value true 

The full set is:

initialized metamorphic constant "parse-json-impl" with value 1
initialized metamorphic constant "coldata-batch-size" with value 3824 
initialized metamorphic constant "span-reuse-rate" with value 51
initialized metamorphic constant "lock-table-iters-before-seek" with value 0
initialized metamorphic constant "storage.ingest_as_flushable.enabled" with value false
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 0
initialized metamorphic constant "kv.rangefeed.use_dedicated_connection_class.enabled" with value true 
initialized metamorphic constant "use-mux-rangefeed" with value true 
initialized metamorphic constant "default-batch-bytes-limit" with value 7840 
initialized metamorphic constant "addsst-rewrite-concurrency" with value 1
initialized metamorphic constant "addsst-rewrite-forced" with value true 
initialized metamorphic constant "split-scans-right-for-stats-first" with value true 
initialized metamorphic constant "raft-log-truncation-clearrange-threshold" with value 637823
initialized metamorphic constant "kv.snapshot.ingest_as_write_threshold" with value 0
initialized metamorphic constant "kv_rangefeed_scheduler_enabled" with value true 
initialized metamorphic constant "kv-batch-size" with value 1
initialized metamorphic constant "datum-row-converter-batch-size" with value 1
initialized metamorphic constant "row-container-rows-per-chunk-shift" with value 1
initialized metamorphic constant "max-batch-size" with value 4746 
initialized metamorphic constant "max-batch-byte-size" with value 16300215
initialized metamorphic constant "inverted-joiner-batch-size" with value 1
initialized metamorphic constant "zig-zag-joiner-batch-size" with value 1
initialized metamorphic constant "spilling-queue-initial-len" with value 4
initialized metamorphic constant "merge-joiner-groups-buffer" with value 6
initialized metamorphic constant "ColIndexJoin-using-streamer-batch-size" with value 7838319
initialized metamorphic constant "parallel-scan-result-threshold" with value 1448 
initialized metamorphic constant "resolve-membership-single-scan-enabled" with value false
initialized metamorphic constant "copy-batch-size" with value 9246 
initialized metamorphic constant "async-IE-result-channel-buffer-size" with value 30

@erikgrinaker
Copy link
Contributor

Of course, the iterator that was actually nil was intentInterleavingIter.iter:

@erikgrinaker
Copy link
Contributor

Could be a double-close -- we nil out the iterator and put it back into the pool on the first close:

func (i *intentInterleavingIter) Close() {
i.iter.Close()
i.intentIter.Close()
*i = intentInterleavingIter{
intentKeyAsNoTimestampMVCCKeyBacking: i.intentKeyAsNoTimestampMVCCKeyBacking,
intentKeyBuf: i.intentKeyBuf,
intentLimitKeyBuf: i.intentLimitKeyBuf,
}
intentInterleavingIterPool.Put(i)
}

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 10, 2023

Yeah, there's a race here:

// Always defer closing iterator to cover old and new failure cases.
// On successful path where registration succeeds reset catchUpIter to prevent
// closing it.
defer func() {
if catchUpIter != nil {
catchUpIter.Close()
}
}()

reg, filter := p.Register(span, startTS, catchUpIter, withDiff, stream, func() { r.maybeDisconnectEmptyRangefeed(p) }, done)
if reg {
// Registered successfully with an existing processor.
// Update the rangefeed filter to avoid filtering ops
// that this new registration might be interested in.
r.setRangefeedFilterLocked(filter)
r.rangefeedMu.Unlock()
catchUpIter = nil
return p
}

The semantics are that Register() assumes ownership of the iterator if it returns true, otherwise we have to close it. But this is racy. The catchup scan may complete and close the iterator here:

catchUpIter := r.detachCatchUpIter()
if catchUpIter == nil {
return nil
}
start := timeutil.Now()
defer func() {
catchUpIter.Close()
r.metrics.RangeFeedCatchUpScanNanos.Inc(timeutil.Since(start).Nanoseconds())
}()

But before that returns to the caller, if the stopper fires, we return false back up the stack while waiting for the registration here, which will also have the caller close the iterator:

case r = <-result:
return r
case <-p.stoppedC:
return r

@erikgrinaker
Copy link
Contributor

I'm going to downgrade this to a GA blocker. This only happens during node shutdown, rarely, so we don't have to hold up the beta over it.

@erikgrinaker erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 10, 2023
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels Nov 13, 2023
Copy link

blathers-crl bot commented Nov 13, 2023

cc @cockroachdb/replication

@kvoli kvoli added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot.
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants