sql: nightly benchmark broken #8081

jordanlewis · 2016-07-27T22:24:42Z

The nightly benchmark seems to have stalled during the sql.test portion. The sql.test.stderr log is growing forever with the following log lines:

I160727 10:14:34.148590 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 92 goroutines, 20 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 205.60c
go/sec, 0.03/0.01 %(u/s)time, 0.00 %gc (1x)
W160727 10:14:34.158639 ts/db.go:104  node unavailable; try another peer
W160727 10:14:34.158665 server/node.go:642  error recording status summaries: node unavailable; try another peer
W160727 10:14:42.055111 gossip/gossip.go:942  not connected to cluster; use --join to specify a connected node
I160727 10:14:44.028505 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 94 goroutines, 21 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 215.81c
go/sec, 0.02/0.01 %(u/s)time, 0.00 %gc (0x)

The test invocation on TeamCity is here: https://teamcity.cockroachdb.com/viewLog.html?buildId=5068&buildTypeId=Cockroach_BenchmarkTests&tab=buildLog
he nightly benchmark seems to have stalled during the sql.test portion. The sql.test.stderr log is growing forever with the following log lines:

I160727 10:14:34.148590 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 92 goroutines, 20 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 205.60c
go/sec, 0.03/0.01 %(u/s)time, 0.00 %gc (1x)
W160727 10:14:34.158639 ts/db.go:104  node unavailable; try another peer
W160727 10:14:34.158665 server/node.go:642  error recording status summaries: node unavailable; try another peer
W160727 10:14:42.055111 gossip/gossip.go:942  not connected to cluster; use --join to specify a connected node
I160727 10:14:44.028505 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 94 goroutines, 21 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 215.81c
go/sec, 0.02/0.01 %(u/s)time, 0.00 %gc (0x)

The test invocation on TeamCity is here: https://teamcity.cockroachdb.com/viewLog.html?buildId=5068&buildTypeId=Cockroach_BenchmarkTests&tab=buildLog

UPDATE: current issue is that sql.BenchmarkPgbenchExec_{Cockroach,Postgres} fail in the absence of pgbench, which we do not install anywhere.

The text was updated successfully, but these errors were encountered:

jordanlewis · 2016-07-27T22:36:07Z

I stopped the build but left the instance running in case someone wants to do forensics.

https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-b/instances/benchmark-static-tests-0?project=cockroach-shared&graph=GCE_CPU&duration=P4D

cuongdo · 2016-07-28T20:02:47Z

Seems like an issue shutting down the server after a test. Not much else to go on.

jordanlewis · 2016-07-29T21:07:24Z

@WillHaack seems to have more details on this. Could you chime in?

WillHaack · 2016-08-01T16:10:52Z

I believe @cuongdo is correct. Some tests may timeout, and when they do I'm not sure they close down the servers properly so the subsequent tests may have trouble binding to ports. (Also, I'm not even sure that they close their connections properly when they successfully run.) I'll try to work on fixing this today. If all else fails it may be worth making it so that multinode tests only run with the -multinode flag until further notice. (IMO it's not a big deal to keep multinode tests that are failing, but it is a big deal if they are causing other tests to fail.)

WillHaack · 2016-08-02T15:15:08Z

#8075 likely fixed the issue. The benchmark tests have been running for 7+ hours now. They were failing consistently in about 2.5hrs before.

jordanlewis · 2016-08-03T14:29:40Z

@WillHaack I think they're broken in a different way now. Can you please investigate? They've now been running for 30 hours -- seems like they might have stalled again. https://teamcity.cockroachdb.com/viewLog.html?buildId=7571&tab=buildResultsDiv&buildTypeId=Cockroach_BenchmarkTests

tbg · 2016-10-27T08:00:41Z

Do they now pass, with #10237 in?

tamird · 2016-10-27T08:33:01Z

Looks like they are broken by the move to azure builders? Tonight's run failed with "terraform is not in your path": https://teamcity.cockroachdb.com/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=37243#_focus=724

Probably we should run this orchestration from inside the builder container instead of adding terraform to the TC builders. cc @jordanlewis

jordanlewis · 2016-10-31T21:24:16Z

The benchmark tests still fail. It seems that a bunch of tests fail that require Postgres to be installed. Did we used to ignore these somehow? The move to Azure didn't affect what is installed and running on the benchmark machines.

tamird · 2016-11-02T16:05:26Z

Current issue is that sql.BenchmarkPgbenchExec_{Cockroach,Postgres} fail in the absence of pgbench, which we do not install anywhere. Updated the description and assigned @dt.

Currently blocking cockroachdb#8081

jordanlewis · 2016-11-07T22:24:38Z

The Pgbench tests now do not fail - thanks @dt. The last remaining failures are runs of BenchmarkInsertDistinct100Multinode_Cockroach. The logs don't have anything particularly useful, but here they are for completeness:

BenchmarkInsertDistinct100Multinode_Cockroach-16                --- FAIL: BenchmarkInsertDistinct100Multinode_Cockroach-16
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
BenchmarkInsertDistinct100Multinode_Cockroach-16                --- FAIL: BenchmarkInsertDistinct100Multinode_Cockroach-16
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
BenchmarkInsertDistinct100Multinode_Cockroach-16                    1000       1026240 ns/op      877898 B/op       4798 allocs/op
BenchmarkInsertDistinct100Multinode_Cockroach-16                --- FAIL: BenchmarkInsertDistinct100Multinode_Cockroach-16
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
BenchmarkInsertDistinct100Multinode_Cockroach-16                    2000        992051 ns/op      811930 B/op       4099 allocs/op
BenchmarkInsertDistinct100Multinode_Cockroach-16                --- FAIL: BenchmarkInsertDistinct100Multinode_Cockroach-16
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
BenchmarkInsertDistinct100Multinode_Cockroach-16                     100      46957150 ns/op     2767293 B/op      23394 allocs/op
BenchmarkInsertDistinct100Multinode_Cockroach-16                    2000       1023041 ns/op      821077 B/op       4196 allocs/op
BenchmarkInsertDistinct100Multinode_Cockroach-16                --- FAIL: BenchmarkInsertDistinct100Multinode_Cockroach-16
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn
    bench_test.go:1221: pq: replay txn

jordanlewis · 2016-11-07T22:25:15Z

I spoke too soon: in the stderr log, we have the following exciting panic:

panic: nil clock is forbidden

goroutine 1619880 [running]:
panic(0x1745440, 0xc427428630)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/rpc.NewContext(0x0, 0x0, 0x0, 0x0, 0xc42dbae280, 0x0, 0xc42f988000, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:118 +0x962
github.com/cockroachdb/cockroach/pkg/sql_test.newKVNative(0xc42fafe280, 0x6, 0x19154b9)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/kv_test.go:69 +0x31f
github.com/cockroachdb/cockroach/pkg/sql_test.runKVBenchmark(0xc42fafe280, 0x19154b9, 0x6, 0x1915d5f, 0x6, 0x1)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/kv_test.go:289 +0x717
github.com/cockroachdb/cockroach/pkg/sql_test.BenchmarkKVInsert1_Native(0xc42fafe280)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/kv_test.go:322 +0x5e
testing.(*B).runN(0xc42fafe280, 0x1)
    /usr/local/go/src/testing/benchmark.go:139 +0xaa
testing.(*B).run1.func1(0xc42fafe280)
    /usr/local/go/src/testing/benchmark.go:208 +0x5a
created by testing.(*B).run1
    /usr/local/go/src/testing/benchmark.go:209 +0x7f

The Benchmark tests panic due to `kv_test` passing a `nil` clock to `rpc.NewContext`. Fix this. See cockroachdb#8081

jordanlewis · 2016-11-08T20:35:06Z

Okay, the nil clock panic is fixed but the above failures remain.

jordanlewis · 2016-11-08T21:03:33Z

Tracking the above failures in #10551.

jordanlewis · 2016-11-10T16:22:46Z

After ignoring the problems in #10551, a new set of problems turned up. The sql benchmarks now pass, but the storage benchmarks seem to hang indefinitely on the very first iteration of BenchmarkReplicaSnapshot-16. stderr is pretty sparse:

I161110 01:45:45.896322 1 rand.go:76  Random seed: -8566366354663296610
I161110 01:45:45.901316 44 gossip/gossip.go:237  [n?] initial resolvers: []
W161110 01:45:45.901389 44 gossip/gossip.go:1055  [n?] no resolvers found; use --join to specify a connected node
I161110 01:45:45.901466 44 base/node_id.go:62  NodeID set to 1
I161110 01:45:45.901753 44 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161110 01:45:45.924072 51 storage/replica_proposal.go:328  [s1,r1/1:/M{in-ax}] new range lease replica {1 1 1} 197
0-01-01 00:00:00 +0000 UTC 410761h45m46.820924177s following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [phys
icalTime=2016-11-10 01:45:45.923900995 +0000 UTC]
I161110 01:46:02.682472 44 storage/replica_raftstorage.go:445  [s1,r1/1:/M{in-ax}] generated snapshot ffacc4de at i
ndex 58289 in 209.989µs.
I161110 01:46:02.682550 44 util/stop/stopper.go:396  stop has been called, stopping or quiescing all running tasks
I161110 01:46:02.687700 101 gossip/gossip.go:237  [n?] initial resolvers: []
W161110 01:46:02.687724 101 gossip/gossip.go:1055  [n?] no resolvers found; use --join to specify a connected node
I161110 01:46:02.687763 101 base/node_id.go:62  NodeID set to 1
I161110 01:46:02.687814 101 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161110 01:46:02.690377 110 storage/replica_proposal.go:328  [s1,r1/1:/M{in-ax}] new range lease replica {1 1 1} 19
70-01-01 00:00:00 +0000 UTC 410761h46m3.589595495s following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [phys
icalTime=2016-11-10 01:46:02.690335401 +0000 UTC]
I161110 01:46:19.191801 101 storage/replica_raftstorage.go:445  [s1,r1/1:/M{in-ax}] generated snapshot 43717985 at 
index 58288 in 143.803µs.
E161110 01:56:02.691713 189 storage/queue.go:579  [replicate,s1,r1/1:/M{in-ax}] purgatory: 0 of 0 stores with an at
tribute matching []; likely not enough nodes in cluster
E161110 02:06:02.692100 349 storage/queue.go:646  [replicate] 1 replicas failing with "0 of 0 stores with an attrib
ute matching []; likely not enough nodes in cluster"
E161110 02:16:02.692040 349 storage/queue.go:646  [replicate] 1 replicas failing with "0 of 0 stores with an attrib
ute matching []; likely not enough nodes in cluster"
E161110 02:26:02.692106 349 storage/queue.go:646  [replicate] 1 replicas failing with "0 of 0 stores with an attrib
ute matching []; likely not enough nodes in cluster"

These errors repeat every ten minutes.

jordanlewis · 2016-11-14T15:28:30Z

The nightly benchmarks pass most of the time now. Closing this in favor of more targeted issues for occasional failures.

tamird mentioned this issue Aug 8, 2016

storage: deflake TestAcceptsUnsplitRanges #8396

Merged

tamird closed this as completed in #8396 Aug 9, 2016

tamird reopened this Aug 9, 2016

tamird mentioned this issue Sep 29, 2016

testcluster: always wait for full replication #9624

Closed

tamird mentioned this issue Oct 25, 2016

multi-node serializability test failures #8057

Closed

tbg mentioned this issue Oct 26, 2016

testutils/testcluster: automatically wait-for-full-replication #10237

Merged

tamird mentioned this issue Nov 1, 2016

sql: skip postgres,mysql tests when servers aren't running #10352

Merged

tamird changed the title ~~benchmark: nightly benchmark stalled~~ sql: nightly benchmark broken Nov 2, 2016

tamird assigned dt Nov 2, 2016

dt mentioned this issue Nov 7, 2016

sql: skip pgbench if not on PATH #10509

Merged

dt added a commit to dt/cockroach that referenced this issue Nov 7, 2016

sql: skip pgbench if not on PATH

e871713

Currently blocking cockroachdb#8081

jordanlewis unassigned dt Nov 7, 2016

jordanlewis mentioned this issue Nov 7, 2016

sql: don't pass a nil clock in kv_test #10526

Merged

jordanlewis added a commit to jordanlewis/cockroach that referenced this issue Nov 8, 2016

sql: don't pass a nil clock in kv_test

e955b27

The Benchmark tests panic due to `kv_test` passing a `nil` clock to `rpc.NewContext`. Fix this. See cockroachdb#8081

tamird mentioned this issue Nov 10, 2016

storage: call (*Replica).CloseOutSnap() after discarding snapshot #10598

Merged

jordanlewis closed this as completed Nov 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: nightly benchmark broken #8081

sql: nightly benchmark broken #8081

jordanlewis commented Jul 27, 2016 •

edited by tamird

Loading

jordanlewis commented Jul 27, 2016

cuongdo commented Jul 28, 2016

jordanlewis commented Jul 29, 2016

WillHaack commented Aug 1, 2016

WillHaack commented Aug 2, 2016

jordanlewis commented Aug 3, 2016

tbg commented Oct 27, 2016

tamird commented Oct 27, 2016

jordanlewis commented Oct 31, 2016

tamird commented Nov 2, 2016 •

edited

Loading

jordanlewis commented Nov 7, 2016

jordanlewis commented Nov 7, 2016

jordanlewis commented Nov 8, 2016

jordanlewis commented Nov 8, 2016

jordanlewis commented Nov 10, 2016

jordanlewis commented Nov 14, 2016

sql: nightly benchmark broken #8081

sql: nightly benchmark broken #8081

Comments

jordanlewis commented Jul 27, 2016 • edited by tamird Loading

jordanlewis commented Jul 27, 2016

cuongdo commented Jul 28, 2016

jordanlewis commented Jul 29, 2016

WillHaack commented Aug 1, 2016

WillHaack commented Aug 2, 2016

jordanlewis commented Aug 3, 2016

tbg commented Oct 27, 2016

tamird commented Oct 27, 2016

jordanlewis commented Oct 31, 2016

tamird commented Nov 2, 2016 • edited Loading

jordanlewis commented Nov 7, 2016

jordanlewis commented Nov 7, 2016

jordanlewis commented Nov 8, 2016

jordanlewis commented Nov 8, 2016

jordanlewis commented Nov 10, 2016

jordanlewis commented Nov 14, 2016

jordanlewis commented Jul 27, 2016 •

edited by tamird

Loading

tamird commented Nov 2, 2016 •

edited

Loading