Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in MarshalTo(0xc13312b400, 0xc1541c14ec, 0x13, 0x13, 0x15, 0x2, 0x11) #35803

Closed
awoods187 opened this issue Mar 15, 2019 · 9 comments
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@awoods187
Copy link
Contributor

awoods187 commented Mar 15, 2019

I hit this on a 7 node 36 cluster running tpc-c 10k on master v19.1.0-beta.20190304-458-g70e3468

* ERROR: [n4,client=172.31.31.59:43580,user=root,txn=67663fee] a panic has occurred!
*
runtime error: index out of range

goroutine 43064141 [running]:
runtime/debug.Stack(0x39db080, 0xc112e8c390, 0xc000000003)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/util/log.ReportPanic(0x39db080, 0xc112e8c390, 0xc0001ab300, 0x2e831e0, 0x569d210, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/crash_reporting.go:212 +0xa6
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000372090, 0x39db080, 0xc112e8c390)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:183 +0xe4
panic(0x2e831e0, 0x569d210)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanRequest).MarshalTo(0xc13312b400, 0xc1541c14ec, 0x13, 0x13, 0x15, 0x2, 0x11)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10185 +0x17d
github.com/cockroachdb/cockroach/pkg/roachpb.(*RequestUnion_Scan).MarshalTo(0xc0b74b3328, 0xc1541c14ea, 0x15, 0x15, 0xf4f31e, 0xc0b74b3328, 0x17)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:13287 +0xdf
github.com/cockroachdb/cockroach/pkg/roachpb.(*RequestUnion).MarshalTo(0xc153344838, 0xc1541c14ea, 0x15, 0x15, 0x17, 0xca, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:13188 +0x73
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).MarshalTo(0xc027b50f80, 0xc1541c1420, 0xdf, 0xdf, 0xdf, 0xdf, 0x323a3e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14600 +0x246
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).Marshal(0xc027b50f80, 0x323a3e0, 0xc027b50f80, 0x7fb509a2d278, 0xc027b50f80, 0xc12d8fa301)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14575 +0x7f
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto.codec.Marshal(0x323a3e0, 0xc027b50f80, 0xc0d04d02e0, 0x3, 0xc0000ab470, 0xc0000ab400, 0x7fb50ab7c7f0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto/proto.go:70 +0x19c
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.encode(0x7fb50aa1ed38, 0x5b31a68, 0x323a3e0, 0xc027b50f80, 0xc017624870, 0x3a06e60, 0x39dbdc0, 0x6, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/rpc_util.go:487 +0x5e
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc183977930, 0x323a3e0, 0xc027b50f80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:482 +0xca
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc027b51000, 0x323a3e0, 0xc027b50f80, 0xc000b04600, 0x32b2f93)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:403 +0x43
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.invoke(0x39db080, 0xc112e8c390, 0x32b2f93, 0x21, 0x323a3e0, 0xc027b50f80, 0x318bfe0, 0xc04003c4d0, 0xc000b04600, 0xc0052d4400, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/call.go:75 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingClientInterceptor.func1(0x39db080, 0xc112e8c390, 0x32b2f93, 0x21, 0x323a3e0, 0xc027b50f80, 0x318bfe0, 0xc04003c4d0, 0xc000b04600, 0x33de350, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/client.go:47 +0xb49
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc000b04600, 0x39db080, 0xc112e8c390, 0x32b2f93, 0x21, 0x323a3e0, 0xc027b50f80, 0x318bfe0, 0xc04003c4d0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/call.go:35 +0x109
github.com/cockroachdb/cockroach/pkg/roachpb.(*internalClient).Batch(0xc1062bf2e0, 0x39db080, 0xc112e8c390, 0xc027b50f80, 0x0, 0x0, 0x0, 0xc112e8c390, 0x2, 0x39db080)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9324 +0xd2
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc017624810, 0x39db080, 0xc112e8c390, 0x2, 0x39ae740, 0xc1062bf2e0, 0x0, 0x0, 0x200000002, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:199 +0x126
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc017624810, 0x39db080, 0xc112e8c390, 0x0, 0x0, 0x200000002, 0x2, 0x7a89, 0x0, 0xc0d9a72d00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:168 +0x130
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0009667e0, 0x39db080, 0xc112e8c390, 0xc000966830, 0x7a89, 0xc0e01999f0, 0x3, 0x3, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x2d3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0009667e0, 0x39db080, 0xc112e8c390, 0x7a89, 0xc0e01999f0, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0009667e0, 0x39db080, 0xc112e8c390, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0d9a72d00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0009667e0, 0x39db080, 0xc112e8c390, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0d9a72d00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x322
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatchAsync.func1(0x39db080, 0xc112e8c390)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1063 +0x175
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc000372090, 0x39db080, 0xc112e8c390, 0x32bec16, 0x24, 0xc000a342a0, 0x3a10ae0, 0xc0003a7f60, 0xc0358f1960)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a

I'm pretty sure it's different than #34241 because the portion after MarshalTo is different.

cockroach.log

@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Mar 15, 2019
@petermattis
Copy link
Collaborator

@awoods187 Please always include cockroach version so we can know what SHA you were running.

I'm pretty sure it's different than #34241 because the portion after MarshalTo is different.

The portion in parentheses are the arguments to MarshalTo. It is unsurprising that these will be different than what we saw in #34241. In fact, it would be extremely surprising if you saw identical values.

@awoods187
Copy link
Contributor Author

I added the version (it's today's master post the pr merge).

I thought that'a what @tbg said here: #34241 (comment)

@petermattis
Copy link
Collaborator

I thought that'a what @tbg said here: #34241 (comment)

I believe @tbg was saying it was a different instance of a panic, not that it was a different bug. Every instance of the panic should have a different receiver.

@awoods187
Copy link
Contributor Author

@tbg
Copy link
Member

tbg commented Mar 18, 2019

@nvanbenschoten it looks like this SHA, regrettably, has your fix. Could you take a look?

@nvanbenschoten
Copy link
Member

Yeah, I'm taking a look. I'm starting by adding back in the withMarshalingDebugging assertion.

@nvanbenschoten
Copy link
Member

@awoods187 just saw this again in #35890 (comment).

I added the withMarshalingDebugging assertion back in and ran the new kv50/enc=false/nodes=4/cpu=96/batch=64 roachtest against this. No luck. I then verified again that without the change to BatchResponse_Header.combine in 05a93a6#diff-05786eae3e4792be88fc07714423ca76R386, we hit the assertion immediately. That indicates that this issue is in some way different than what we ended up solving in #34241.

I'm going to add the extra assertion back in to master and try to catch this using tpcc, Andy Woods style.

craig bot pushed a commit that referenced this issue Mar 19, 2019
35910: Revert "roachpb: clone Txn object in BatchResponse_Header.combine" r=nvanbenschoten a=nvanbenschoten

This reverts the removal of the assertion in commit 05a93a6. The actual clone of the Txn object in `BatchResponse_Header.combine` is kept because that was again verified to fix _some_ issue in #35803 (comment).

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 22, 2019

We saw this fire again in #32135 (comment).

panic: batch size 197 -> 195 bytes
re-marshaled protobuf:
00000000  0a a5 01 0a 00 12 06 08  02 10 02 18 02 18 f5 19  |................|
00000010  2a 8e 01 0a 22 0a 10 f1  bd 50 46 46 dc 4f 7e 88  |*..."....PFF.O~.|
00000020  bd 7e 27 6c 41 3e 81 2a  0a 08 80 a0 cc ed c0 83  |.~'lA>.*........|
00000030  83 c7 15 30 c6 bb 3b 12  07 75 6e 6e 61 6d 65 64  |...0..;..unnamed|
00000040  2a 0c 08 ec 86 d0 b9 b1  84 83 c7 15 10 3e 32 0a  |*............>2.|
00000050  08 80 a0 cc ed c0 83 83  c7 15 3a 0a 08 80 a0 cc  |..........:.....|
00000060  ed c0 83 83 c7 15 42 0e  08 02 12 0a 08 87 c9 9e  |......B.........|
00000070  8b b2 84 83 c7 15 42 10  08 04 12 0c 08 e9 ca de  |......B.........|
00000080  84 b2 84 83 c7 15 10 06  42 10 08 07 12 0c 08 e9  |........B.......|
00000090  ca de 84 b2 84 83 c7 15  10 3c 72 00 7a 00 80 01  |.........<r.z...|
000000a0  01 40 90 4e 50 01 58 08  12 19 3a 17 0a 13 1a 08  |[email protected]...:.....|
000000b0  c1 89 9a f8 01 48 08 88  22 07 c1 89 9b f8 01 5a  |.....H.."......Z|
000000c0  07 20 01                                          |. .|

original panic:  <nil>
 [recovered]
	panic: batch size 197 -> 195 bytes
re-marshaled protobuf:
00000000  0a a5 01 0a 00 12 06 08  02 10 02 18 02 18 f5 19  |................|
00000010  2a 8e 01 0a 22 0a 10 f1  bd 50 46 46 dc 4f 7e 88  |*..."....PFF.O~.|
00000020  bd 7e 27 6c 41 3e 81 2a  0a 08 80 a0 cc ed c0 83  |.~'lA>.*........|
00000030  83 c7 15 30 c6 bb 3b 12  07 75 6e 6e 61 6d 65 64  |...0..;..unnamed|
00000040  2a 0c 08 ec 86 d0 b9 b1  84 83 c7 15 10 3e 32 0a  |*............>2.|
00000050  08 80 a0 cc ed c0 83 83  c7 15 3a 0a 08 80 a0 cc  |..........:.....|
00000060  ed c0 83 83 c7 15 42 0e  08 02 12 0a 08 87 c9 9e  |......B.........|
00000070  8b b2 84 83 c7 15 42 10  08 04 12 0c 08 e9 ca de  |......B.........|
00000080  84 b2 84 83 c7 15 10 06  42 10 08 07 12 0c 08 e9  |........B.......|
00000090  ca de 84 b2 84 83 c7 15  10 3c 72 00 7a 00 80 01  |.........<r.z...|
000000a0  01 40 90 4e 50 01 58 08  12 19 3a 17 0a 13 1a 08  |[email protected]...:.....|
000000b0  c1 89 9a f8 01 48 08 88  22 07 c1 89 9b f8 01 5a  |.....H.."......Z|
000000c0  07 20 01                                          |. .|

original panic:  <nil>


goroutine 155643 [running]:
panic(0x2d98760, 0xc011d20b30)
	/usr/local/go/src/runtime/panic.go:556 +0x2cb fp=0xc00c613060 sp=0xc00c612fd0 pc=0x72b49b
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges.func1(0xc00c614398, 0xc00c614688, 0xc00c6145c8, 0xc00c614680, 0xc00c6142ff, 0xc00c614380, 0xc00c614304)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:825 +0x50e fp=0xc00c613228 sp=0xc00c613060 pc=0x1596fde
runtime.call64(0x0, 0x33f8970, 0xc006c03370, 0x3800000038)
	/usr/local/go/src/runtime/asm_amd64.s:523 +0x3b fp=0xc00c613278 sp=0xc00c613228 pc=0x75a0eb
panic(0x2d98760, 0xc011d20b30)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc00c613308 sp=0xc00c613278 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/kv.withMarshalingDebugging.func1(0xc00914a300, 0xc5)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:181 +0x264 fp=0xc00c6133d8 sp=0xc00c613308 pc=0x15992f4
github.com/cockroachdb/cockroach/pkg/kv.withMarshalingDebugging(0x39fcf80, 0xc0104c2420, 0x0, 0x0, 0x200000002, 0x2, 0xcf5, 0x0, 0xc00b310a00, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:185 +0xac fp=0xc00c613410 sp=0xc00c6133d8 pc=0x158555c
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc0104c23f0, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x200000002, 0x2, 0xcf5, 0x0, 0xc00b310a00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:202 +0x211 fp=0xc00c6135f0 sp=0xc00c613410 pc=0x15857c1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc00068a000, 0x39fcf80, 0xc00f95e3f0, 0xc00068a050, 0xcf5, 0xc00eab4690, 0x3, 0x3, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x2d3 fp=0xc00c613998 sp=0xc00c6135f0 pc=0x157b563
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc00068a000, 0x39fcf80, 0xc00f95e3f0, 0xcf5, 0xc00eab4690, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244 fp=0xc00c613ab8 sp=0xc00c613998 pc=0x1575824
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc00068a000, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00b310a00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221 fp=0xc00c613c38 sp=0xc00c613ab8 pc=0x1575da1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc00068a000, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00b310a00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x322 fp=0xc00c6141e8 sp=0xc00c613c38 pc=0x15796b2
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc00068a000, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00b310a00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:962 +0x8b3 fp=0xc00c6145b0 sp=0xc00c6141e8 pc=0x15784c3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc00068a000, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00df8ce00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x48b fp=0xc00c614898 sp=0xc00c6145b0 pc=0x157735b
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc00586bf10, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00df8ce00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:234 +0xe8 fp=0xc00c614950 sp=0xc00c614898 pc=0x1586838
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc00586be48, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00df8ce00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:144 +0x83 fp=0xc00c614a30 sp=0xc00c614950 pc=0x1593e13
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc00586be48, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00df8ce00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:100 +0xf9 fp=0xc00c614b08 sp=0xc00c614a30 pc=0x1593a09
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc00586bdc0, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00df8ce00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:183 +0xf9 fp=0xc00c614c20 sp=0xc00c614b08 pc=0x1591a09
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc00586bd68, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00df8ce00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:92 +0x23b fp=0xc00c614d50 sp=0xc00c614c20 pc=0x15937cb
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc00586bb00, 0x39fcf80, 0xc00f95e3f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:780 +0x591 fp=0xc00c615098 sp=0xc00c614d50 pc=0x1589de1
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc0007a9200, 0x39fcf80, 0xc00f95e3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x119 fp=0xc00c615168 sp=0xc00c615098 pc=0x1175459
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc009ee4ab0, 0x39fcf80, 0xc00f95e3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:791 +0x13c fp=0xc00c6152c8 sp=0xc00c615168 pc=0x117f5fc
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).fetch(0xc00a72c8f0, 0x39fcf80, 0xc00f95e3c0, 0xc00778ecf0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:242 +0x626 fp=0xc00c6157c8 sp=0xc00c6152c8 pc=0x1c3bd96
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch(0xc00a72c8f0, 0x39fcf80, 0xc00f95e3c0, 0x7fac61f486c0, 0x7a, 0xc007a4aa90, 0x70b01f, 0xc0116c0000, 0x5000, 0x4b08, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:326 +0x1dd fp=0xc00c6159c8 sp=0xc00c6157c8 pc=0x1c3c86d
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc01010bcd8, 0x39fcf80, 0xc00f95e3c0, 0xc00eb60540, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:72 +0x113 fp=0xc00c615b20 sp=0xc00c6159c8 pc=0x1c3d0c3
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextKey(0xc01010bca0, 0x39fcf80, 0xc00f95e3c0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:490 +0x82 fp=0xc00c615c20 sp=0xc00c615b20 pc=0x1c2dec2
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).StartScanFrom(0xc01010bca0, 0x39fcf80, 0xc00f95e3c0, 0x39d7380, 0xc00a72c8f0, 0x0, 0xe39201)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:480 +0x97 fp=0xc00c615c60 sp=0xc00c615c20 pc=0x1c2ddc7
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).StartScan(0xc01010bca0, 0x39fcf80, 0xc00f95e3c0, 0xc009ee4ab0, 0xc00b927000, 0x157, 0x157, 0x1, 0x0, 0x2fc2100, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:471 +0x1e7 fp=0xc00c615e78 sp=0xc00c615c60 pc=0x1c2dc87
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Start(0xc01010b800, 0x39fcec0, 0xc009289340, 0xc000030d00, 0xc000a93f70)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:253 +0x271 fp=0xc00c615f20 sp=0xc00c615e78 pc=0x1e04ee1
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc0119d0a80, 0x39fcec0, 0xc009289340)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:165 +0x5b fp=0xc00c615fa0 sp=0xc00c615f20 pc=0x1dea6db
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00abeb4a0, 0xc011023840, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:567 +0x67 fp=0xc00c615fc8 sp=0xc00c615fa0 pc=0x1e13c27
runtime.goexit()

Unfortunately, we were only logging the new proto (size 195) because the extra protobuf marshalling present in Tobi's fork was too expensive to merge into master. The protobuf we do see looked like:

roachpb.BatchRequest{
                Header: roachpb.Header{
                    Timestamp: hlc.Timestamp{},
                    Replica:   roachpb.ReplicaDescriptor{
                        NodeID:               2,
                        StoreID:              2,
                        ReplicaID:            2,
                        XXX_NoUnkeyedLiteral: struct {}{},
                        XXX_sizecache:        0,
                    },
                    RangeID:      3317,
                    UserPriority: 0,
                    Txn:          &roachpb.Transaction{
                        TxnMeta: enginepb.TxnMeta{
                            ID:        {0xf1, 0xbd, 0x50, 0x46, 0x46, 0xdc, 0x4f, 0x7e, 0x88, 0xbd, 0x7e, 0x27, 0x6c, 0x41, 0x3e, 0x81},
                            Key:       nil,
                            Epoch:     0x0,
                            Timestamp: hlc.Timestamp{
                                WallTime:             1553192236117856256,
                                Logical:              0,
                                XXX_NoUnkeyedLiteral: struct {}{},
                                XXX_sizecache:        0,
                            },
                            Priority:             974278,
                            Sequence:             0,
                            XXX_NoUnkeyedLiteral: struct {}{},
                            XXX_sizecache:        0,
                        },
                        Name:          "unnamed",
                        Status:        0,
                        LastHeartbeat: hlc.Timestamp{
                            WallTime:             1553192266342073196,
                            Logical:              62,
                            XXX_NoUnkeyedLiteral: struct {}{},
                            XXX_sizecache:        0,
                        },
                        OrigTimestamp: hlc.Timestamp{
                            WallTime:             1553192236117856256,
                            Logical:              0,
                            XXX_NoUnkeyedLiteral: struct {}{},
                            XXX_sizecache:        0,
                        },
                        MaxTimestamp: hlc.Timestamp{
                            WallTime:             1553192236117856256,
                            Logical:              0,
                            XXX_NoUnkeyedLiteral: struct {}{},
                            XXX_sizecache:        0,
                        },
                        RefreshedTimestamp: hlc.Timestamp{},
                        ObservedTimestamps: {
                            {
                                NodeID:    2,
                                Timestamp: hlc.Timestamp{
                                    WallTime:             1553192266513228935,
                                    Logical:              0,
                                    XXX_NoUnkeyedLiteral: struct {}{},
                                    XXX_sizecache:        0,
                                },
                                XXX_NoUnkeyedLiteral: struct {}{},
                                XXX_sizecache:        0,
                            },
                            {
                                NodeID:    4,
                                Timestamp: hlc.Timestamp{
                                    WallTime:             1553192266499597673,
                                    Logical:              6,
                                    XXX_NoUnkeyedLiteral: struct {}{},
                                    XXX_sizecache:        0,
                                },
                                XXX_NoUnkeyedLiteral: struct {}{},
                                XXX_sizecache:        0,
                            },
                            {
                                NodeID:    7,
                                Timestamp: hlc.Timestamp{
                                    WallTime:             1553192266499597673,
                                    Logical:              60,
                                    XXX_NoUnkeyedLiteral: struct {}{},
                                    XXX_sizecache:        0,
                                },
                                XXX_NoUnkeyedLiteral: struct {}{},
                                XXX_sizecache:        0,
                            },
                        },
                        DeprecatedWriting:        false,
                        WriteTooOld:              false,
                        Intents:                  nil,
                        InFlightWrites:           {},
                        EpochZeroTimestamp:       hlc.Timestamp{},
                        OrigTimestampWasObserved: true,
                        XXX_NoUnkeyedLiteral:     struct {}{},
                        XXX_sizecache:            0,
                    },
                    ReadConsistency:      0,
                    MaxSpanRequestKeys:   10000,
                    DistinctSpans:        false,
                    ReturnRangeInfo:      true,
                    GatewayNodeID:        8,
                    ScanOptions:          (*roachpb.ScanOptions)(nil),
                    AsyncConsensus:       false,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
                Requests: {
                    {
                        Value: &roachpb.RequestUnion_Scan{
                            Scan: &roachpb.ScanRequest{
                                RequestHeader: roachpb.RequestHeader{
                                    Key:                  {0xc1, 0x89, 0x9a, 0xf8, 0x1, 0x48, 0x8, 0x88},
                                    EndKey:               {0xc1, 0x89, 0x9b, 0xf8, 0x1, 0x5a, 0x7},
                                    Sequence:             0,
                                    XXX_NoUnkeyedLiteral: struct {}{},
                                    XXX_sizecache:        0,
                                },
                                ScanFormat:           1,
                                XXX_NoUnkeyedLiteral: struct {}{},
                                XXX_sizecache:        0,
                            },
                        },
                        XXX_NoUnkeyedLiteral: struct {}{},
                        XXX_sizecache:        0,
                    },
                },
                XXX_NoUnkeyedLiteral: struct {}{},
                XXX_sizecache:        0,
            }

One interesting thing that we see from the stack trace is that this is a panic on the non-sendPartialBatchAsync path. This tells us two things:

  1. this issue is not related to 05a93a6, which removed a data race caused by aliasing between BatchRequests and BatchResponses. We know this because the sendPartialBatch is synchronous, so the BatchResponse could not have been created yet.
  2. we see further up the stack and can tell that this is a leaf TxnCoordSender and not the original TxnCoordSender. I went back to Panic runtime error: index out of range when running 6 node tpc-c without range merging #34241 and confirmed that this is the case in two two cases where we have a stacktrace as well.

I think the second point is key. We had been assuming that a BatchRequest that hit the local RPC fast-path (i.e. did not serialize its proto) could never add a brand new observed timestamp here. We made this assumption because a new transaction always starts with an observed timestamp for its local node. But this isn't true for leaf transactions.

What this means is that it doesn't seem too crazy that we could be getting all the way to here on a local RPC:

cockroach/pkg/roachpb/data.go

Lines 1767 to 1770 in 7b689f2

s = append(s, ObservedTimestamp{})
copy(s[i+1:], s[i:])
s[i] = ObservedTimestamp{NodeID: nodeID, Timestamp: timestamp}
return s

We also have a bit more evidence to support this. In #34241 (comment), we saw that

before sending, the batch has ObservedTimestamps entries for n1-n3, n5-n7. After sending, it has n1-n6.

Take a look at which node crashed - node 4! The node whose observed timestamp was added between the original marshalling and the new marshalling of the BatchRequest.

Conveniently, we actually already have the solution for this: #35762. I think we should leave the assertion in for a bit longer and also push to get that PR into 19.1. How do you feel about that @tbg? We could also restrict the backport to just the update to observedTimestampSlice.update.

@tbg
Copy link
Member

tbg commented Mar 22, 2019

How do you feel about that @tbg?

Let's do it, thanks for figuring this out!

craig bot pushed a commit that referenced this issue Mar 25, 2019
35762: roachpb: refine Transaction proto cloning r=nvanbenschoten a=nvanbenschoten

Fixes #35803.

This PR includes the final two commits from #35719.

> By making Transaction.ObservedTimestamps immutable (which it almost already was), we can prohibit all interior mutability of references within Transaction, give it value semantics, and eliminate the distinction between "shallow" and "deep" object cloning. This reduces the cost of a clone to a single straightforward allocation and makes working with the object easier to think about.

cc. @tbg 

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #35762 Mar 25, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 17, 2019
This debugging function was removed when we fixed cockroachdb#34241 and added back
in when cockroachdb#35803 appeared because it was clear that we hadn't fully fixed
the issue. It's been about 2 months since cockroachdb#35762 merged and we haven't
seen any issues since, so this can now be removed.

I don't think we meant to keep this in for the 19.1 release. We should
backport this commit.

Release note: None
craig bot pushed a commit that referenced this issue May 17, 2019
37559: kv: remove withMarshalingDebugging function r=nvanbenschoten a=nvanbenschoten

This debugging function was removed when we fixed #34241 and added back
in when #35803 appeared because it was clear that we hadn't fully fixed
the issue. It's been about 2 months since #35762 merged and we haven't
seen any issues since, so this can now be removed.

I don't think we meant to keep this in for the 19.1 release. We should
backport this commit.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

4 participants