Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: stack overflow in InternalExecutor.Exec{Ex} #109197

Closed
yuzefovich opened this issue Aug 21, 2023 · 5 comments · Fixed by #114398
Closed

sql: stack overflow in InternalExecutor.Exec{Ex} #109197

yuzefovich opened this issue Aug 21, 2023 · 5 comments · Fixed by #114398
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Aug 21, 2023

Over in https://cockroachlabs.slack.com/archives/C04U1BTF8/p1692639696886179, Lidor and Steven observed that when running cluster-to-cluster replication, after two out of five nodes crashed due to out-of-disk, the remaining three nodes soon crashed with the following:

goroutine 127729672 [running]:
runtime.sellock({0xc0a1a0a4b8, 0x3, 0x3?}, {0xc0a1a0a4b2, 0x2, 0x3?})
        GOROOT/src/runtime/select.go:34 +0xc5 fp=0xc0a1a0a338 sp=0xc0a1a0a330 pc=0x4ab425
runtime.selectgo(0xc0a1a0a4b8, 0xc0a1a0a4ac, 0x0?, 0x0, 0x0?, 0x1)
        GOROOT/src/runtime/select.go:231 +0x2d0 fp=0xc0a1a0a478 sp=0xc0a1a0a338 pc=0x4ab930
github.com/cockroachdb/cockroach/pkg/sql.(*ieResultChannel).firstResult(0xc043ffca80, {0x78d2230, 0xc070ba0e10})
        github.com/cockroachdb/cockroach/pkg/sql/internal_result_channel.go:123 +0xf9 fp=0xc0a1a0a5d0 sp=0xc0a1a0a478 pc=0x3765ed9
github.com/cockroachdb/cockroach/pkg/sql.(*ieResultChannel).nextResult(0x0?, {0x78d2230, 0xc070ba0e10})
        github.com/cockroachdb/cockroach/pkg/sql/internal_result_channel.go:162 +0xfe fp=0xc0a1a0a6f0 sp=0xc0a1a0a5d0 pc=0x376659e
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next(0xc01b95ad80, {0x78d2230?, 0xc070ba0e10?})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:532 +0x21a fp=0xc0a1a0a878 sp=0xc0a1a0a6f0 pc=0x375e71a
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func2({{0x0, 0x0, 0x0}, 0xc03a182cb8, {0x0, 0x0, 0x0}, {0x0, 0x0}})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:500 +0x26c fp=0xc0a1a0a918 sp=0xc0a1a0a878 pc=0x375ebec
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next(0xc01b95ad80, {0x78d2230?, 0xc070ba0e10?})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:536 +0x39f fp=0xc0a1a0aaa0 sp=0xc0a1a0a918 pc=0x375e89f
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func2({{0x0, 0x0, 0x0}, 0xc03a182c38, {0x0, 0x0, 0x0}, {0x0, 0x0}})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:500 +0x26c fp=0xc0a1a0ab40 sp=0xc0a1a0aaa0 pc=0x375ebec
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next(0xc01b95ad80, {0x78d2230?, 0xc070ba0e10?})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:536 +0x39f fp=0xc0a1a0acc8 sp=0xc0a1a0ab40 pc=0x375e89f
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func2({{0x0, 0x0, 0x0}, 0xc03a182be8, {0x0, 0x0, 0x0}, {0x0, 0x0}})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:500 +0x26c fp=0xc0a1a0ad68 sp=0xc0a1a0acc8 pc=0x375ebec
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next(0xc01b95ad80, {0x78d2230?, 0xc070ba0e10?})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:536 +0x39f fp=0xc0a1a0aef0 sp=0xc0a1a0ad68 pc=0x375e89f
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func2({{0x0, 0x0, 0x0}, 0xc03a182b98, {0x0, 0x0, 0x0}, {0x0, 0x0}})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:500 +0x26c fp=0xc0a1a0af90 sp=0xc0a1a0aef0 pc=0x375ebec
github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next(0xc01b95ad80, {0x78d2230?, 0xc070ba0e10?})
        github.com/cockroachdb/cockroach/pkg/sql/internal.go:536 +0x39f fp=0xc0a1a0b118 sp=0xc0a1a0af90 pc=0x375e89f
...

This was on 9849680, and it seems likely that c09860b has some bug in it.

Jira issue: CRDB-30817

@yuzefovich yuzefovich added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Aug 21, 2023
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Aug 21, 2023
@yuzefovich yuzefovich added the T-sql-queries SQL Queries Team label Aug 25, 2023
@yuzefovich
Copy link
Member Author

Closing as unactionable.

@github-project-automation github-project-automation bot moved this from Triage to Done in SQL Queries Sep 12, 2023
@yuzefovich yuzefovich self-assigned this Nov 14, 2023
@yuzefovich yuzefovich reopened this Nov 14, 2023
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Nov 14, 2023
@yuzefovich yuzefovich removed this from SQL Queries Nov 14, 2023
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Nov 14, 2023
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Nov 14, 2023
@yuzefovich
Copy link
Member Author

We saw this problem again, CCT 23.2 cluster, running alpha.6. Here are the stack dumps of 7 nodes that crashed due to the stack overflow overflow.zip. Interestingly, on n1 we have 5 goroutines that had their stack elided, on node 6 two (out of which 1 was unrelated), and on all others just one.

The unfortunate thing is that only in go 1.21 we get the tail of the stacks, so we currently don't know what was the query that triggered this, and 23.2 uses 1.20. My hypothesis is that it's Exec{Ex} method and c09860b is to blame.

@yuzefovich
Copy link
Member Author

I have a somewhat far-fetched theory that the stack overflow is caused not by an infinite recursion but by a somewhat bounded recursion which depends on whether we hit too many retries or not.

I have been focusing on node 1 crash since it was the first one. In the stack trace I fished out the following

goroutine
goroutine 265767935 [select]:
runtime.gopark(0xc191228418?, 0x2?, 0xf8?, 0x74?, 0xc191228414?)
	GOROOT/src/runtime/proc.go:381 +0xd6 fp=0xc1912282a0 sp=0xc191228280 pc=0x49ce76
runtime.selectgo(0xc191228418, 0xc191228410, 0x7fc5ff135d28?, 0x0, 0xc000180000?, 0x1)
	GOROOT/src/runtime/select.go:327 +0x7be fp=0xc1912283e0 sp=0xc1912282a0 pc=0x4ad57e
google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc008cae900)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:328 +0x7c fp=0xc191228448 sp=0xc1912283e0 pc=0xc084fc
google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:343
google.golang.org/grpc.(*csAttempt).recvMsg(0xc0e653a410, {0x60dd020?, 0xc0e650e900}, 0x7fc5cec83f28?)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1046 +0xc5 fp=0xc191228578 sp=0xc191228448 pc=0xc3ce05
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x0?)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:900 +0x25 fp=0xc1912285a8 sp=0xc191228578 pc=0xc3c285
google.golang.org/grpc.(*clientStream).withRetry(0xc008cae000, 0xc1912286b0, 0xc191228680)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:751 +0x144 fp=0xc191228618 sp=0xc1912285a8 pc=0xc3aca4
google.golang.org/grpc.(*clientStream).RecvMsg(0xc008cae000, {0x60dd020?, 0xc0e650e900?})
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:899 +0x125 fp=0xc1912286e0 sp=0xc191228618 pc=0xc3bf05
google.golang.org/grpc.invoke({0x7962098?, 0xc0e650d5c0?}, {0x6378919?, 0x2?}, {0x62449a0, 0xc06bc9bd40}, {0x60dd020, 0xc0e650e900}, 0xc1912287c8?, {0xc0219abac0, ...})
	google.golang.org/grpc/external/org_golang_google_grpc/call.go:73 +0xd7 fp=0xc191228748 sp=0xc1912286e0 pc=0xc196d7
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2({0x7962098, 0xc0e650d5c0}, {0x6378919, 0x21}, {0x62449a0, 0xc06bc9bd40}, {0x60dd020, 0xc0e650e900}, 0xc095c72000?, 0x66c9a48, ...)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:249 +0x424 fp=0xc191228838 sp=0xc191228748 pc=0x1bff044
google.golang.org/grpc.(*ClientConn).Invoke(0x44f5b2729366322b?, {0x7962098?, 0xc0e650d5c0?}, {0x6378919?, 0x0?}, {0x62449a0?, 0xc06bc9bd40?}, {0x60dd020?, 0xc0e650e900?}, {0x0, ...})
	google.golang.org/grpc/external/org_golang_google_grpc/call.go:35 +0x223 fp=0xc1912288d8 sp=0xc191228838 pc=0xc19523
github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*internalClient).Batch(0xc0e6202240, {0x7962098, 0xc0e650d5c0}, 0x0?, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-opt/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10074 +0xc9 fp=0xc191228958 sp=0xc1912288d8 pc=0x10e3929
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.TracingInternalClient.Batch({{0x79cbfb0?, 0xc0e6202240?}}, {0x7962098, 0xc0e650d5c0}, 0xc06bc9bb00, {0x0, 0x0, 0x0})
	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:282 +0x1fa fp=0xc191228a40 sp=0xc191228958 pc=0x1c32b7a
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*TracingInternalClient).Batch(0x0?, {0x7962098?, 0xc0e650d5c0?}, 0x7962098?, {0x0?, 0x0?, 0x1?})
	<autogenerated>:1 +0x65 fp=0xc191228a90 sp=0xc191228a40 pc=0x1c32ca5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc00d60dda0, {0x7962098, 0xc0e650d5c0}, 0xc6?, {0x795fcb0, 0xc0e5f0dc10?}, 0xc06bc9bb00)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:211 +0x1dc fp=0xc191228d40 sp=0xc191228a90 pc=0x1c9c81c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc00d60dda0, {0x7962098, 0xc0e650d5c0}, 0xc00d60dda0?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:189 +0x92 fp=0xc191228d98 sp=0xc191228d40 pc=0x1c9c5f2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc000885b80, {0x7962098, 0xc0e650d5c0}, 0xc06bc9b680?, {0xc000bdb770, 0xc02bb0bee0, 0xc02bb0bf50, 0x0, 0x0}, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2386 +0x11ea fp=0xc191229498 sp=0xc191228d98 pc=0x1c8b3ca
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000885b80, {0x7962098?, 0xc0e650d5c0}, 0xc06bc9b680, {{0xc0cfe16108, 0x3, 0x3}, {0xc0cfe16128, 0x2, 0x2}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1887 +0x7b4 fp=0xc191229b80 sp=0xc191229498 pc=0x1c88074
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000885b80, {0x7962098, 0xc0e650d5c0}, 0xc06bc9b680, {{0xc0cfe16108, 0x3, 0x3}, {0xc0cfe16128, 0x2, 0x2}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1455 +0x3c9 fp=0xc19122a180 sp=0xc191229b80 pc=0x1c85529
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000885b80, {0x7962098, 0xc0e650d560}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1071 +0x67a fp=0xc19122a3f0 sp=0xc19122a180 pc=0x1c836ba
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc01b6815a8, {0x7962098, 0xc0e650d560}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x1e2 fp=0xc19122a490 sp=0xc19122a3f0 pc=0x1cb3c22
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc01b681570, {0x7962098?, 0xc0e650d560?}, 0xc0e650d4d0?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:47 +0xd2 fp=0xc19122a4e0 sp=0xc19122a490 pc=0x1cab112
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc01b681490, {0x7962098, 0xc0e650d560}, 0xc06bc9b680, 0x5)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222 +0x203 fp=0xc19122a900 sp=0xc19122a4e0 pc=0x1cb06a3
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc01b681490, {0x7962098, 0xc0e650d560}, 0xc19122a9c8?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:150 +0xb3 fp=0xc19122a948 sp=0xc19122a900 pc=0x1cb0053
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc01b681458, {0x7962098, 0xc0e650d560}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:146 +0x48c fp=0xc19122a9c0 sp=0xc19122a948 pc=0x1ca6f0c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc01b681328, {0x7962098, 0xc0e650d560}, 0x30?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:295 +0x125 fp=0xc19122aa38 sp=0xc19122a9c0 pc=0x1cab765
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc01b681308, {0x7962098, 0xc0e650d560}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:114 +0x23f fp=0xc19122ab20 sp=0xc19122aa38 pc=0x1cafc7f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc01b681258, {0x7962098, 0xc0e650d560}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:246 +0x4a7 fp=0xc19122ac68 sp=0xc19122ab20 pc=0x1ca8ea7
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc01b681080, {0x7962098, 0xc0e650d4d0}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:533 +0x5b2 fp=0xc19122ae78 sp=0xc19122ac68 pc=0x1c9ef92
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000b77d60, {0x7962098, 0xc0e650d4d0}, 0xc06bc9b680, {0x7fc5b29afba8, 0xc01b681080})
	github.com/cockroachdb/cockroach/pkg/kv/db.go:1090 +0xe7 fp=0xc19122aef0 sp=0xc19122ae78 pc=0x17b4607
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0e5f1f680, {0x7962098, 0xc0e650d4d0}, 0xc06bc9b680)
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1218 +0x1ef fp=0xc19122b130 sp=0xc19122aef0 pc=0x17be7af
github.com/cockroachdb/cockroach/pkg/sql/row.makeTxnKVFetcherDefaultSendFunc.func1({0x7962098?, 0xc0e650d4d0?}, 0xc0e650d4d0?)
	github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:278 +0x3c fp=0xc19122b168 sp=0xc19122b130 pc=0x243315c
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).fetch(0xc12ecbf2c0, {0x7962098, 0xc0e650d4d0})
	github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:583 +0x353 fp=0xc19122b268 sp=0xc19122b168 pc=0x2434293
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch(0xc12ecbf2c0, {0x7962098, 0xc0e650d4d0})
	github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:859 +0x1115 fp=0xc19122b548 sp=0xc19122b268 pc=0x2435975
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch-fm({0x7962098?, 0xc0e650d4d0?})
	<autogenerated>:1 +0xa5 fp=0xc19122b620 sp=0xc19122b548 pc=0x244ad25
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvBatchFetcherHelper).NextBatch(0xc12ecbf2c0, {0x7962098?, 0xc0e650d4d0?})
	github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:1008 +0xa9 fp=0xc19122b750 sp=0xc19122b620 pc=0x24368c9
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).NextBatch(0x30?, {0x7962098?, 0xc0e650d4d0?})
	<autogenerated>:1 +0x9c fp=0xc19122b828 sp=0xc19122b750 pc=0x244a07c
github.com/cockroachdb/cockroach/pkg/sql/row.(*KVFetcher).nextKV(0xc07377d3b0, {0x7962098, 0xc0e650d4d0}, 0x0)
	github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:281 +0x10d fp=0xc19122b9c8 sp=0xc19122b828 pc=0x243888d
github.com/cockroachdb/cockroach/pkg/sql/row.(*KVFetcher).NextKV(0xf3d32a?, {0x7962098?, 0xc0e650d4d0?}, 0x2?)
	github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:302 +0x7d fp=0xc19122ba78 sp=0xc19122b9c8 pc=0x2438edd
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).NextBatch(0xc00cb5a000, {0x7962098, 0xc0e650d4d0})
	github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:693 +0xdf fp=0xc19122bd98 sp=0xc19122ba78 pc=0x32ce8ff
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*ColBatchScan).Next(0xc0e5f0dad0)
	github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:258 +0x33 fp=0xc19122bde8 sp=0xc19122bd98 pc=0x32d7513
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*CancelChecker).Next(0xc0e6205100)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/cancel_checker.go:59 +0x30 fp=0xc19122be00 sp=0xc19122bde8 pc=0x28c26b0
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*vectorTypeEnforcer).Next(0xc0e6574500)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/operator.go:157 +0x2e fp=0xc19122be48 sp=0xc19122be00 pc=0x28c360e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*defaultBuiltinFuncOperator).Next(0xc0e6570820)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/builtin_funcs.go:46 +0x3e fp=0xc19122bf10 sp=0xc19122be48 pc=0x2a2507e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*notExprSelOp).Next(0xc0e650d2c0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/not_expr_ops.go:147 +0x34 fp=0xc19122bf78 sp=0xc19122bf10 pc=0x2a2e454
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*simpleProjectOp).Next(0xc0e517dbc0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:124 +0x3f fp=0xc19122c028 sp=0xc19122bf78 pc=0x29b0aff
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*vectorTypeEnforcer).Next(0xc0e6574550)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/operator.go:157 +0x2e fp=0xc19122c070 sp=0xc19122c028 pc=0x28c360e
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.constNullOp.Next({{{{0x797d1b8, 0xc0e6574550}}, {{0x7962098, 0xc0e650d470}}}, 0x7})
	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/colexec/colexecbase/const.eg.go:611 +0x42 fp=0xc19122c0c0 sp=0xc19122c070 pc=0x29f8122
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*constNullOp).Next(0x7962098?)
	<autogenerated>:1 +0x5d fp=0xc19122c120 sp=0xc19122c0c0 pc=0x2a16e1d
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*vectorTypeEnforcer).Next(0xc0e65745a0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/operator.go:157 +0x2e fp=0xc19122c168 sp=0xc19122c120 pc=0x28c360e
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*castOpNullAny).Next(0xc08a3aa620)
	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/colexec/colexecbase/cast.eg.go:1199 +0x3f fp=0xc19122c210 sp=0xc19122c168 pc=0x29b513f
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*vectorTypeEnforcer).Next(0xc0e65745f0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/operator.go:157 +0x2e fp=0xc19122c258 sp=0xc19122c210 pc=0x28c360e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*projectInOpBytes).Next(0xc0e517dc20)
	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/colexec/select_in.eg.go:734 +0x3e fp=0xc19122c358 sp=0xc19122c258 pc=0x2a7ef9e
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*vectorTypeEnforcer).Next(0xc0e6574640)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/operator.go:157 +0x2e fp=0xc19122c3a0 sp=0xc19122c358 pc=0x28c360e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*projectInOpBytes).Next(0xc0e517dce0)
	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/colexec/select_in.eg.go:734 +0x3e fp=0xc19122c4a0 sp=0xc19122c3a0 pc=0x2a7ef9e
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*simpleProjectOp).Next(0xc0e517dda0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:124 +0x3f fp=0xc19122c550 sp=0xc19122c4a0 pc=0x29b0aff
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).next(0xc081c3bdc0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:247 +0x73 fp=0xc19122c578 sp=0xc19122c550 pc=0x2a2d573
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).nextAdapter(...)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:272
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).nextAdapter-fm()
	<autogenerated>:1 +0x2b fp=0xc19122c598 sp=0xc19122c578 pc=0x2abfd4b
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc018959000?)
	github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:92 +0x62 fp=0xc19122c5d8 sp=0xc19122c598 pc=0xf3fd42
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).Next(0xc081c3bdc0)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:278 +0x4c fp=0xc19122c610 sp=0xc19122c5d8 pc=0x2a2d74c
github.com/cockroachdb/cockroach/pkg/sql.(*rowSourceToPlanNode).Next(0xc01925af00, {{0xc018959000?, 0xc0c9e97fe8?}, 0x7965588?, 0xc19122c6d8?})
	github.com/cockroachdb/cockroach/pkg/sql/row_source_to_plan_node.go:79 +0x45 fp=0xc19122c688 sp=0xc19122c610 pc=0x38dd2e5
github.com/cockroachdb/cockroach/pkg/sql.(*updateNode).BatchedNext(0xc018958000, {{0x7962098, 0xc0e650d410}, 0xc018959000, 0xc0c9e97fe8})
	github.com/cockroachdb/cockroach/pkg/sql/update.go:168 +0xc4 fp=0xc19122c6e8 sp=0xc19122c688 pc=0x396b684
github.com/cockroachdb/cockroach/pkg/sql.(*rowCountNode).startExec(0xc006069a40, {{0x7962098?, 0xc0e650d410?}, 0xc018959000?, 0xc0c9e97fe8?})
	github.com/cockroachdb/cockroach/pkg/sql/plan_batch.go:173 +0xce fp=0xc19122c740 sp=0xc19122c6e8 pc=0x389824e
github.com/cockroachdb/cockroach/pkg/sql.startExec.func2({0xc19122c940?, 0x7965c50?}, {0x7965550, 0xc006069a40})
	github.com/cockroachdb/cockroach/pkg/sql/plan.go:520 +0x143 fp=0xc19122c7f0 sp=0xc19122c740 pc=0x38977e3
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal.func1()
	github.com/cockroachdb/cockroach/pkg/sql/walk.go:112 +0x3e fp=0xc19122c828 sp=0xc19122c7f0 pc=0x398e73e
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal(0xc19122c940, {0x7965550?, 0xc006069a40?}, {0x62c03cf?, 0x5?})
	github.com/cockroachdb/cockroach/pkg/sql/walk.go:299 +0xfee fp=0xc19122c8b8 sp=0xc19122c828 pc=0x398e64e
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visit(0xc19122c940, {0x7965550, 0xc006069a40})
	github.com/cockroachdb/cockroach/pkg/sql/walk.go:79 +0xf7 fp=0xc19122c908 sp=0xc19122c8b8 pc=0x398d557
github.com/cockroachdb/cockroach/pkg/sql.walkPlan(...)
	github.com/cockroachdb/cockroach/pkg/sql/walk.go:43
github.com/cockroachdb/cockroach/pkg/sql.startExec({{0x7962098?, 0xc0e650d410?}, 0xc018959000?, 0xc0c9e97fe8?}, {0x7965550, 0xc006069a40})
	github.com/cockroachdb/cockroach/pkg/sql/plan.go:523 +0x119 fp=0xc19122c988 sp=0xc19122c908 pc=0x3897639
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start(0xc0916ccdc0, {0x7962098, 0xc0e650d3e0})
	github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:175 +0x13b fp=0xc19122c9e8 sp=0xc19122c988 pc=0x389a89b
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*FlowCoordinator).Start.func1()
	github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:120 +0x34 fp=0xc19122ca10 sp=0xc19122c9e8 pc=0x32fa1b4
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc0248f4900?)
	github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:92 +0x62 fp=0xc19122ca50 sp=0xc19122ca10 pc=0xf3fd42
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*FlowCoordinator).Start(0xc0248f4900, {0x7961ff0?, 0xc0e65744b0?})
	github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:119 +0x7c fp=0xc19122cac0 sp=0xc19122ca50 pc=0x32fa11c
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).Run(0xc0248f4900, {0x7961ff0?, 0xc0e65744b0?}, {0x79326a0?, 0xc0e6201c00})
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:725 +0x50 fp=0xc19122cb08 sp=0xc19122cac0 pc=0x2408430
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*FlowCoordinator).Run(0xc19122cbd8?, {0x7961ff0?, 0xc0e65744b0?}, {0x79326a0?, 0xc0e6201c00?})
	<autogenerated>:1 +0x38 fp=0xc19122cb40 sp=0xc19122cb08 pc=0x33090b8
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run(0xc0b489a000, {0x7961ff0?, 0xc0e65744b0}, 0x0?)
	github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:579 +0x238 fp=0xc19122cc10 sp=0xc19122cb40 pc=0x27e1618
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Run(0x79db800?, {0x7961ff0?, 0xc0e65744b0?}, 0x80?)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:305 +0x245 fp=0xc19122ccb0 sp=0xc19122cc10 pc=0x32ffd25
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000e31cc0, {0x7962098, 0xc0e650ced0}, 0xc0e6570780, 0xc0e5f1f680, 0xc0e650e700, 0xc0e6201c00, 0xc0c9e984b0, 0xc0e5f0d9f0)
	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:906 +0xb73 fp=0xc19122d600 sp=0xc19122ccb0 pc=0x37c8793
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000e31cc0, {0x7962098, 0xc0e650ced0}, 0xc0c9e984b0, 0xc0e6570780, 0x475c45?, {{0x7965550, 0xc006069a40}, 0x0}, 0xc0e6201c00, ...)
	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1946 +0x230 fp=0xc19122d718 sp=0xc19122d600 pc=0x37cd070
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRunAll.func3(0xc0c9e97fe8, 0x7fbebf12e7c8?, {0x7962098, 0xc0e650ced0}, 0x46ed9f?, 0xc0e5f0d9e0?, 0x10?)
	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1661 +0xcb fp=0xc19122d7b0 sp=0xc19122d718 pc=0x37cb70b
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRunAll(0xc0e6570780?, {0x7962098?, 0xc0e650ced0}, 0xc0c9e984b0, 0xc0e6570780, 0xc0c9e97fe8, 0xc0e6201c00, 0x0)
	github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1664 +0x245 fp=0xc19122d8e0 sp=0xc19122d7b0 pc=0x37cb145
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc0c9e97980, {0x7962098?, 0xc0e650ced0}, 0xc0c9e97fe8, 0xc0e650ced0?, {0x79dd888?, 0xc0e5cdf810}, 0xc0e5cdf810?, 0xc00cb1bc58)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2302 +0x456 fp=0xc19122d9f0 sp=0xc19122d8e0 pc=0x36f49d6
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc0c9e97980, {0x7961ff0, 0xc0e5cdf860}, 0xc0c9e97fe8, {0x79dd888, 0xc0e5cdf810})
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1861 +0x12ff fp=0xc19122e060 sp=0xc19122d9f0 pc=0x36f101f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc0c9e97980, {0x7961ff0, 0xc0e5cdf860}, {{0x798ffe0, 0xc0e5f0bd00}, {0x0, 0x0, 0x0}, {0x6559c8f, 0x13d}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1092 +0x3f85 fp=0xc19122edf0 sp=0xc19122e060 pc=0x36e9385
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1({0x7962098?, 0xc0e60e9ef0?})
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:145 +0xbc fp=0xc19122eea0 sp=0xc19122edf0 pc=0x36e49dc
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling(0x7962098?, {0x7962098?, 0xc0e60e9ef0?}, {0x798ffe0?, 0xc0e5f0bd00?}, 0x5dc7ba0?, 0xc0e5cebc40?)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:3277 +0x32c fp=0xc19122efe8 sp=0xc19122eea0 pc=0x36fbcac
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc0c9e97980, {0x7962098, 0xc0e60e9ef0}, {{0x798ffe0, 0xc0e5f0bd00}, {0x0, 0x0, 0x0}, {0x6559c8f, 0x13d}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:144 +0x465 fp=0xc19122f198 sp=0xc19122efe8 pc=0x36e4365
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPortal(0xc0c9e97980, {0x7962098, 0xc0e60e9ef0}, {{0x0, 0x0}, 0xc079c8f560, {0xc0e5cebc40, 0x2, 0x2}, {0x0, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:250 +0x29e fp=0xc19122f3d0 sp=0xc19122f198 pc=0x36e4d3e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func2({{0x0, 0x0}, 0x0, {0xc14c01f4b86ae852, 0x446f72e480fe, 0x0}, 0x1}, 0xc0c9e97980, 0xc19122f888, 0xc19122f878, ...)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2334 +0x82b fp=0xc19122f6a8 sp=0xc19122f3d0 pc=0x36d55cb
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc0c9e97980)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2336 +0x607 fp=0xc19122fe68 sp=0xc19122f6a8 pc=0x36d3187
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc0c9e97980, {0x7962098, 0xc0e60e9a70}, 0xc0e60e91d0?, 0x1211520?, 0xc01f72ac00?)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2151 +0x216 fp=0xc19122ff40 sp=0xc19122fe68 pc=0x36d27b6
github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).runWithEx.func1()
	github.com/cockroachdb/cockroach/pkg/sql/internal.go:222 +0xaf fp=0xc19122ffe0 sp=0xc19122ff40 pc=0x383f2ef
runtime.goexit()
	src/runtime/asm_amd64.s:1598 +0x1 fp=0xc19122ffe8 sp=0xc19122ffe0 pc=0x4d1801
created by github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).runWithEx
	github.com/cockroachdb/cockroach/pkg/sql/internal.go:221 +0x192

which almost matches the query plan for removeClaimsFromDeadSessions function in jobs/registry.go.

Plan
    └ *colflow.FlowCoordinator
      └ *sql.planNodeToRowSource
        └ *colexec.Materializer
          └ *colexecbase.simpleProjectOp
            └ *colexec.projectInOpBytes
              └ *colexecutils.vectorTypeEnforcer
                └ *colexec.projectInOpBytes
                  └ *colexecutils.vectorTypeEnforcer
                    └ *colexecbase.castOpNullAny
                      └ *colexecutils.vectorTypeEnforcer
                        └ *colexecbase.constNullOp
                          └ *colexecutils.vectorTypeEnforcer
                            └ *colexecbase.simpleProjectOp
                              └ *colexec.notExprSelOp
                                └ *colexec.defaultBuiltinFuncOperator
                                  └ *colexecutils.vectorTypeEnforcer
                                    └ *colexec.selectInOpBytes
                                      └ *colexecutils.CancelChecker
                                        └ *colfetcher.ColBatchScan

The only difference is that in this plan (that I got in a demo) we have colexec.selectInOpBytes which is not present in the goroutine stack. I'm speculating here that in demo I had different stats, so we might be using a different index on the CCT cluster.

The interesting bit is that utility function uses roachpb.MinUserPriority which makes it seem plausible that the query will hit many retries.

Now, both of these reproductions happened during cluster-to-cluster streaming (TODO: confirm that CCT cluster had this), and - I'm speculating a bit here - perhaps C2C needs to read system.jobs table periodically, providing constant stream of reads that would force the low priority UPDATE to hit a retry.

I still don't see a fundamental problem with c09860b; however, it seems prudent to introduce a limit on the number of internal retries performed by Exec{Ex} methods - this was envisioned as a best-effort to reduce the number of retriable errors propagated to the client as part of the fixes in #101477. Having no retry limit whatsoever there seems like an oversight.

@srosenberg
Copy link
Member

The unfortunate thing is that only in go 1.21 we get the tail of the stacks, so we currently don't know what was the query that triggered this, and 23.2 uses 1.20.

Owing to this 1.21 runtime change?

When printing very deep stacks, the runtime now prints the first 50 (innermost) frames followed by the bottom 50 (outermost) frames, rather than just printing the first 100 frames.

I still don't see a fundamental problem with c09860b; however, it seems prudent to introduce a limit on the number of internal retries performed by Exec{Ex} methods - this was envisioned as a best-effort to reduce the number of retriable errors propagated to the client as part of the fixes in #101477. Having no retry limit whatsoever there seems like an oversight.

In addition to the above enhancement, would it make sense to have an explicit upper bound on the depth of rowsIterator.Next? We'd probably need to track depth explicitly, for performance reasons. Or maybe there is a better way to check this during plan construction? The upshot is future-proofing this type of issue where the iterator has no explicit bound.

@yuzefovich
Copy link
Member Author

yuzefovich commented Nov 15, 2023

Owing to this 1.21 runtime change?

When printing very deep stacks, the runtime now prints the first 50 (innermost) frames followed by the bottom 50 (outermost) frames, rather than just printing the first 100 frames.

Yes, precisely.

In addition to the above enhancement, would it make sense to have an explicit upper bound on the depth of rowsIterator.Next? We'd probably need to track depth explicitly, for performance reasons. Or maybe there is a better way to check this during plan construction? The upshot is future-proofing this type of issue where the iterator has no explicit bound.

Yeah, I like that, thanks. I'll replace the limit on the number of retries with the depth limit (the former can be expressed via the latter). EDIT: replacing seems a bit tricky since a single retry corresponds to several levels in recursion, so I'll just add a constant limit on depth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants