Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Crash during YCSB Multi Region tests #35473

Closed
drewdeally opened this issue Mar 6, 2019 · 15 comments
Closed

Node Crash during YCSB Multi Region tests #35473

drewdeally opened this issue Mar 6, 2019 · 15 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@drewdeally
Copy link

./cockroach version
Build Tag: v2.2.0-alpha.20190211
Build Time: 2019/02/07 23:44:57
Distribution: CCL
Platform: linux amd64 (x86_64-unknown-linux-gnu)
Go Version: go1.11.4
C Compiler: gcc 6.3.0
Build SHA-1: 4bdf0ad
Build Type: release

RoachProd

roachprod list -d -m 
drew-demo: [gce] 6h21m9s remaining
  drew-demo-0001	drew-demo-0001.northamerica-northeast1-a.cockroach-ephemeral	10.162.0.3	35.203.28.217
  drew-demo-0002	drew-demo-0002.northamerica-northeast1-a.cockroach-ephemeral	10.162.0.2	35.203.30.45
  drew-demo-0003	drew-demo-0003.northamerica-northeast1-a.cockroach-ephemeral	10.162.0.4	35.203.12.11
  drew-demo-0004	drew-demo-0004.southamerica-east1-a.cockroach-ephemeral	10.158.0.4	35.198.63.142
  drew-demo-0005	drew-demo-0005.southamerica-east1-a.cockroach-ephemeral	10.158.0.3	35.198.61.141
  drew-demo-0006	drew-demo-0006.southamerica-east1-a.cockroach-ephemeral	10.158.0.2	35.198.5.203
  drew-demo-0007	drew-demo-0007.us-west2-a.cockroach-ephemeral	10.168.0.28	35.236.99.56
  drew-demo-0008	drew-demo-0008.us-west2-a.cockroach-ephemeral	10.168.0.27	35.236.60.47
  drew-demo-0009	drew-demo-0009.us-west2-a.cockroach-ephemeral	10.168.0.26	35.236.25.176
  drew-demo-0010	drew-demo-0010.us-east4-a.cockroach-ephemeral	10.150.15.216	35.245.55.140
  drew-demo-0011	drew-demo-0011.us-east4-a.cockroach-ephemeral	10.150.15.214	35.236.245.67
  drew-demo-0012	drew-demo-0012.us-east4-a.cockroach-ephemeral	10.150.15.215	35.245.10.108
  drew-demo-0013	drew-demo-0013.us-central1-a.cockroach-ephemeral	10.128.0.58	35.222.37.96
  drew-demo-0014	drew-demo-0014.us-central1-a.cockroach-ephemeral	10.128.0.59	35.184.91.6
  drew-demo-0015	drew-demo-0015.us-central1-a.cockroach-ephemeral	10.128.0.61	35.192.112.19
  drew-demo-0016	drew-demo-0016.us-central1-b.cockroach-ephemeral	10.128.0.63	35.188.165.190
  drew-demo-0017	drew-demo-0017.us-central1-b.cockroach-ephemeral	10.128.0.62	35.188.116.219
  drew-demo-0018	drew-demo-0018.us-central1-b.cockroach-ephemeral	10.128.0.64	35.192.72.133

Node 2 Abend and a node went suspect. including all logs

1.cockroach.log
2.cockroach.log
3.cockroach.log
4.cockroach.log
5.cockroach.log
6.cockroach.log
7.cockroach.log
8.cockroach.log
9.cockroach.log
10.cockroach.log
11.cockroach.log
12.cockroach.log

image

@drewdeally
Copy link
Author

Node 3 panicked
drew-demo: status 18/18
1: cockroach-v2.2.0-alpha.20190211 23291
2: not running
3: not running
4: cockroach-v2.2.0-alpha.20190211 21784
5: cockroach-v2.2.0-alpha.20190211 21820
6: cockroach-v2.2.0-alpha.20190211 21784
7: cockroach-v2.2.0-alpha.20190211 21746
8: cockroach-v2.2.0-alpha.20190211 21771
9: cockroach-v2.2.0-alpha.20190211 21747
10: cockroach-v2.2.0-alpha.20190211 21734
11: cockroach-v2.2.0-alpha.20190211 21689
12: cockroach-v2.2.0-alpha.20190211 21761
13: not running
14: not running
15: not running
16: not running
17: not running
18: not running

3.cockroach.log

@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Mar 6, 2019
@bdarnell
Copy link
Contributor

bdarnell commented Mar 6, 2019

There's no panic in 3.cockroach.log. Did you post the right file? Why are nodes 13-18 not running?

2.cockroach.log (which is really node 5) had an OOM (caught by the go allocator, not the kernel oom killer). The immediate cause appears to be memory allocated while marshaling a BatchResponse:

W190306 19:27:40.745404 298146 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
I190306 19:27:42.042009 268 gossip/gossip.go:557  [n5] gossip status (ok, 12 nodes)
gossip client (2/3 cur/max conns)
  3: drew-demo-0007:26257 (4h18m0s: infos 42537/201983 sent/received, bytes 13042678B/60512449B sent/received)
  4: drew-demo-0009:26257 (2h18m0s: infos 97296/86938 sent/received, bytes 27562966B/37018116B sent/received)
gossip server (0/3 cur/max conns, infos 139839/288949 sent/received, bytes 40607010B/97543112B sent/received)
gossip connectivity
  n7 [sentinel];
  n1 -> n7; n4 -> n9; n5 -> n3; n5 -> n4; n7 -> n2; n8 -> n2; n9 -> n7; n10 -> n2; n11 -> n3; n11 -> n4; n12 -> n4; n12 -> n10;
W190306 19:27:42.058025 272 server/node.go:883  [n5,summaries] health alerts detected: {Alerts:[{StoreID:5 Category:METRICS Description:ranges.underreplicated Value:136 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x3146f8a, 0x16)
        /usr/local/go/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc0f4000000, 0x4000000, 0x59528d8)
        /usr/local/go/src/runtime/mem_linux.go:156 +0xc7
runtime.(*mheap).sysAlloc(0x590ebe0, 0x4000000, 0x27b3964, 0x7ff9c31fe9b8)
        /usr/local/go/src/runtime/malloc.go:619 +0x1c7
runtime.(*mheap).grow(0x590ebe0, 0x50e, 0x0)
        /usr/local/go/src/runtime/mheap.go:920 +0x42
runtime.(*mheap).allocSpanLocked(0x590ebe0, 0x50e, 0x59528e8, 0x2594d0e)
        /usr/local/go/src/runtime/mheap.go:848 +0x337
runtime.(*mheap).alloc_m(0x590ebe0, 0x50e, 0x3870101, 0x29f29ca)
        /usr/local/go/src/runtime/mheap.go:692 +0x119
runtime.(*mheap).alloc.func1()
        /usr/local/go/src/runtime/mheap.go:759 +0x4c
runtime.(*mheap).alloc(0x590ebe0, 0x50e, 0x2010101, 0xc0a0c74588)
        /usr/local/go/src/runtime/mheap.go:758 +0x8a
runtime.largeAlloc(0xa1bb2c, 0x200730101, 0xc00093e000)
        /usr/local/go/src/runtime/malloc.go:1019 +0x97
runtime.mallocgc.func1()
        /usr/local/go/src/runtime/malloc.go:914 +0x46
runtime.systemstack(0x7ff9cfaadcd0)
        /usr/local/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1229

goroutine 298323 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:311 fp=0xc0a0c757a8 sp=0xc0a0c757a0 pc=0x737620
runtime.mallocgc(0xa1bb2c, 0x2c42940, 0xc047986c01, 0x0)
        /usr/local/go/src/runtime/malloc.go:913 +0x896 fp=0xc0a0c75848 sp=0xc0a0c757a8 pc=0x6e90c6
runtime.makeslice(0x2c42940, 0xa1bb2c, 0xa1bb2c, 0xc0a0c758c0, 0x6e70c3, 0x2d7f0c0)
        /usr/local/go/src/runtime/slice.go:70 +0x77 fp=0xc0a0c75878 sp=0xc0a0c75848 pc=0x7201a7
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchResponse).Marshal(0xc008532a80, 0x3043100, 0xc008532a80, 0x7ff9cfc27dd0, 0xc008532a80, 0x14f5801)
        /go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14562 +0x49 fp=0xc0a0c758d0 sp=0xc0a0c75878 pc=0xf24179
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto.codec.Marshal(0x3043100, 0xc008532a80, 0x8, 0x30eeb80, 0xc0240476c0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto/proto.go:70 +0x19c fp=0xc0a0c75950 sp=0xc0a0c758d0 pc=0xcd80fc
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto.(*codec).Marshal(0x594f180, 0x3043100, 0xc008532a80, 0x7ff9d0a3f3a8, 0xc0890f2f60, 0xc005416d80, 0x0, 0xc005f97a70)
        <autogenerated>:1 +0x46 fp=0xc0a0c75998 sp=0xc0a0c75950 pc=0xcd8876
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.encode(0x7ff9d0a3f3a8, 0x594f180, 0x3043100, 0xc008532a80, 0x594f180, 0x14f5d16, 0x38d0f80, 0xc057be4030, 0x30eeb80)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/rpc_util.go:487 +0x5e fp=0xc0a0c75a18 sp=0xc0a0c75998 pc=0xce790e
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).sendResponse(0xc000a4f340, 0x38f9740, 0xc0052c9200, 0xc059225000, 0x3043100, 0xc008532a80, 0x0, 0x0, 0xc00bd0f6fc, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:830 +0x86 fp=0xc0a0c75b20 sp=0xc0a0c75a18 pc=0xcec576
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc000a4f340, 0x38f9740, 0xc0052c9200, 0xc059225000, 0xc000ab11d0, 0x54c38c0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1036 +0x5b0 fp=0xc0a0c75dc8 sp=0xc0a0c75b20 pc=0xced310
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleStream(0xc000a4f340, 0x38f9740, 0xc0052c9200, 0xc059225000, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1249 +0x1311 fp=0xc0a0c75f80 sp=0xc0a0c75dc8 pc=0xcf0ee1
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc005b98300, 0xc000a4f340, 0x38f9740, 0xc0052c9200, 0xc059225000)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:680 +0x9f fp=0xc0a0c75fb8 sp=0xc0a0c75f80 pc=0xcf835f
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0a0c75fc0 sp=0xc0a0c75fb8 pc=0x739701
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:678 +0xa1

Memory (go alloc) was high prior to the crash:

I190306 19:25:46.839143 270 server/status/runtime.go:464  [n5] runtime stats: 3.1 GiB RSS, 374 goroutines, 224 MiB/21 MiB/396 MiB GO alloc/idle/total, 1.7 GiB/2.3 GiB CGO alloc/to
tal, 27489.1 CGO/sec, 192.4/53.8 %(u/s)time, 0.0 %gc (9x), 41 MiB/103 MiB (r/w)net
I190306 19:25:56.842300 270 server/status/runtime.go:464  [n5] runtime stats: 4.0 GiB RSS, 383 goroutines, 305 MiB/828 MiB/1.3 GiB GO alloc/idle/total, 1.7 GiB/2.2 GiB CGO alloc/t
otal, 28022.2 CGO/sec, 220.8/72.2 %(u/s)time, 0.4 %gc (11x), 110 MiB/284 MiB (r/w)net
I190306 19:26:06.845300 270 server/status/runtime.go:464  [n5] runtime stats: 4.6 GiB RSS, 383 goroutines, 1.0 GiB/1.2 GiB/2.3 GiB GO alloc/idle/total, 1.8 GiB/2.2 GiB CGO alloc/t
otal, 27281.0 CGO/sec, 196.0/60.2 %(u/s)time, 0.0 %gc (3x), 278 MiB/232 MiB (r/w)net
I190306 19:26:16.848260 270 server/status/runtime.go:464  [n5] runtime stats: 5.4 GiB RSS, 383 goroutines, 2.1 GiB/635 MiB/2.9 GiB GO alloc/idle/total, 1.7 GiB/2.2 GiB CGO alloc/t
otal, 24753.2 CGO/sec, 190.2/53.4 %(u/s)time, 0.1 %gc (2x), 658 MiB/103 MiB (r/w)net
I190306 19:26:26.851628 270 server/status/runtime.go:464  [n5] runtime stats: 6.0 GiB RSS, 379 goroutines, 3.1 GiB/194 MiB/3.4 GiB GO alloc/idle/total, 1.7 GiB/2.2 GiB CGO alloc/t
otal, 28524.0 CGO/sec, 193.5/58.4 %(u/s)time, 0.0 %gc (1x), 236 MiB/53 MiB (r/w)net
I190306 19:26:36.891272 270 server/status/runtime.go:464  [n5] runtime stats: 6.4 GiB RSS, 379 goroutines, 3.3 GiB/105 MiB/3.6 GiB GO alloc/idle/total, 1.8 GiB/2.3 GiB CGO alloc/t
otal, 25194.9 CGO/sec, 168.5/52.8 %(u/s)time, 0.0 %gc (1x), 218 MiB/47 MiB (r/w)net
I190306 19:27:36.907500 270 server/status/runtime.go:464  [n5] runtime stats: 6.4 GiB RSS, 385 goroutines, 2.6 GiB/1.1 GiB/3.8 GiB GO alloc/idle/total, 1.8 GiB/2.3 GiB CGO alloc/total, 168.6 CGO/sec, 2.6/1.5 %(u/s)time, 0.0 %gc (0x), 332 KiB/218 KiB (r/w)net

The goroutine dumps include some very deep recursion (about 75 layers):

goroutine 298318 [runnable]:
github.com/cockroachdb/cockroach/pkg/util/cache.(*Entry).Compare(0xc085ff7740, 0x389a9e0, 0xc0062960c0, 0xc085ffece0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/cache/cache.go:86 +0x92
github.com/cockroachdb/cockroach/vendor/github.com/biogo/store/llrb.(*Node).ceil(0xc0098600c0, 0x389a9e0, 0xc085ff7740, 0xffffffffffffffff)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/biogo/store/llrb/llrb.go:424 +0x54
github.com/cockroachdb/cockroach/vendor/github.com/biogo/store/llrb.(*Node).ceil(0xc006829200, 0x389a9e0, 0xc085ff7740, 0xc085ff7740)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/biogo/store/llrb/llrb.go:430 +0x8a
github.com/cockroachdb/cockroach/vendor/github.com/biogo/store/llrb.(*Tree).Ceil(0xc000637bf0, 0x389a9e0, 0xc085ff7740, 0x1536e01, 0xc085ffece0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/biogo/store/llrb/llrb.go:413 +0x4b
github.com/cockroachdb/cockroach/pkg/util/cache.(*OrderedCache).CeilEntry(0xc000637b90, 0x2d7a040, 0xc085ffece0, 0xc085ffece0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/cache/cache.go:407 +0x75
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).getCachedRangeDescriptorLocked(0xc00029dc20, 0xc00c0b0da0, 0xc, 0x10, 0x0, 0x3, 0x7, 0xc0cad92c40, 0xfa67c1)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:537 +0xd3
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).lookupRangeDescriptorInternal(0xc00029dc20, 0x38d0f80, 0xc0570057d0, 0xc00c0b0da0, 0xc, 0x10, 0xc08aaca480, 0x0, 0x
0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:264 +0x101
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).LookupRangeDescriptor(0xc00029dc20, 0x38d0f80, 0xc0570057d0, 0xc00c0b0da0, 0xc, 0x10, 0xc08aaca480, 0x3fc3333333333
300, 0xc000ad23c0, 0xc089508cc0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:240 +0x92
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).getDescriptor(0xc0007c3320, 0x38d0f80, 0xc0570057d0, 0xc00c0b0da0, 0xc, 0x10, 0xc08aaca480, 0xc000ad2300, 0x2faf080, 0x3b9aca
00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:456 +0x8d
github.com/cockroachdb/cockroach/pkg/kv.(*RangeIterator).Seek(0xc0cad93518, 0x38d0f80, 0xc0570057d0, 0xc00c0b0da0, 0xc, 0x10, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/range_iter.go:160 +0x1e4
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0007c3320, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0d6384200, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:900 +0x419
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0007c3320, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x48b
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc08518b3d0, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:234 +0xe8
github.com/cockroachdb/cockroach/pkg/kv.(*txnMetrics).SendLocked(0xc08518b398, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_metrics.go:56 +0xa2
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc08518b300, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:160 +0x83
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc08518b300, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:101 +0xf9
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc08518b280, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:165 +0xf9
github.com/cockroachdb/cockroach/pkg/kv.(*txnIntentCollector).SendLocked(0xc08518b240, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_intent_collector.go:105 +0x474
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc08518b380, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_sequence_nums.go:66 +0x23b
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).SendLocked(0xc08518b188, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0888dba00, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:248 +0x533
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc08518b000, 0x38d0f80, 0xc0570057d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:652 +0x57c
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc0000e6b00, 0x38d0f80, 0xc0883b5530, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x119
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc049955e60, 0x38d0f80, 0xc0883b5530, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:804 +0x13c
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).fetch(0xc089f515f0, 0x38d0f80, 0xc0883b5530, 0x6bd6ce, 0xc0f18ca008)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:242 +0x626
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch(0xc089f515f0, 0x38d0f80, 0xc0883b5530, 0xc046099188, 0xc0f18ca008, 0x11, 0x6bd6c6, 0xc0f18ca023, 0x401, 0x6b
d6ab, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:326 +0x1dd
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046099001, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:71 +0x2ef
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046099001, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098f01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098f01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098e01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098d01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098d01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098c01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098b01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098b01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098a01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098a01, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098901, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098801, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098801, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098701, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098601, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc08a3b14d8, 0x38d0f80, 0xc0883b5530, 0xc046098601, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:79 +0x42b

@andreimatei
Copy link
Contributor

andreimatei commented Mar 6, 2019

@drewdeally please help us out and look for log messages when filing crash reports. There generally is something.

Here's what I found. Note that node ids unfortunately don't correspond to roachprod indexes.

n5 (roachprod index 2):
fatal error: runtime: out of memory

Can you please upload the heap profiles collected on that node (head_profiler under the log dir)? A debug.zip would also have them.
We should also look into the apparent recursion from the stack @bdarnell pasted.

n? (roachprod index 1):

Error: could not cleanup temporary directories from record file: could not lock temporary directory /mnt/data1/cockroach/cockroach-temp304357385, may still be in use: IO error: While lock file: /mnt/data1/cockroach/cockroach-temp304357385/TEMP_DIR.LOCK: Resource temporarily unavailable
  • did you try to restart while the old node was still running and you uploaded the wrong log?

Neither n3 nor roachprod index 3 (n6) seem dead (the 3.cockroach.log you've attached does not seem to show a dead node).
However, all the nodes seem to have trouble connecting to both n5 (the dead one) and n6. I'm not sure why they fail to connect to n6. Either the networking is really bad, or more likely we have a bug. The circuit breaker to n6 is constantly getting tripped because of context canceled errors. We erroneously trip the breaker on these errors; I was just talking about this yesterday with @ajwerner. It's likely that the contexts get canceled because of the failure to connect to the dead node.

Nit: I've conducted a quick poll around and nobody knew what "abend" means :). Google barely knows.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 6, 2019

@drewdeally there should be heap profiles in the logs directory; can you grab those? (or give us a debug zip instead of these logs, but be sure to bring the downed nodes back up)

I think node n6 got oom-killed so there's nothing in the logs.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 6, 2019

Also, how was YCSB run? (zipfian or uniform?)

@ajwerner
Copy link
Contributor

ajwerner commented Mar 6, 2019

#35433 should alleviate the tripping of circuitbreakers for nodes other than the one which is down

@drewdeally
Copy link
Author

@ben this is ycsb uniform and nodes 13-18 are used to demo expansion.

@andrei "Here's what I found. Note that node ids unfortunately don't correspond to roachprod indexes.
" seems to be the case. Sorry I was note clear node 2 is drew-demo-0002 instead of the CR node ID.

@drewdeally
Copy link
Author

I can not explain drew-demo-0003, but I had to restart. I continued on with testing. Here are all the logs with debug

debug.zip

@drewdeally
Copy link
Author

Seq of events

image

using the ID n1 was down, n5 and then n6. drew-demo-001,002,003.

@drewdeally
Copy link
Author

@tbg may I retest this scenario?

@tbg
Copy link
Member

tbg commented Aug 26, 2020

@drewdeally could you post reproduction steps? I closed this issue because the investigation had gone cold.
I also wouldn't mind if you tried to retest, but it may not be a good use of your time.

@tbg tbg reopened this Aug 26, 2020
@tbg
Copy link
Member

tbg commented Aug 26, 2020

@nvanbenschoten we're not running YCSB in any multi-region configurations in roachtest. Should we?

@tbg
Copy link
Member

tbg commented Aug 26, 2020

Chatted with @drewdeally directly - the workload was running fine; it's unlikely that YCSB is the problem here. The cluster was used for a large-scale recovery experiment. Drew said he would set this up again and hand off the cluster to us if a reproduction is made.

@nvanbenschoten
Copy link
Member

we're not running YCSB in any multi-region configurations in roachtest. Should we?

I don't think it would show us anything interesting. The way it's run in these demos is to completely partition the load so that there's no interaction between traffic in different regions. So it's no more interesting than running kv in a multi-region config with a table per region.

@mwang1026
Copy link

@drewdeally did you ever repro this? Closing for now, reopen if you do have repro to look at

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

8 participants