roachtest: tpccbench/nodes=3/cpu=16 failed #61973

cockroach-teamcity · 2021-03-13T11:32:17Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@4d44ddf24153d8ef8e0a996fdbe75ac5607f9574:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 1: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (5) 3: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #61696 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2021-03-14T10:52:27Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@bdff5338ca725bf1cfddf7e3f648bbf02ab42999:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (5) 2: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #61696 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2021-03-15T08:04:55Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (5) 2: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #61696 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker
roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

irfansharif · 2021-03-15T13:30:05Z

Looks like this is still getting tripped up after #61777 and #61965. Looking.

irfansharif · 2021-03-15T14:05:07Z

Here's a refined view of this test's teamcity history on master. I'm filtering for only the GCE suite, which is where it fails consistently.

What happened between 15th-18th Feb?

tbg · 2021-03-15T14:17:25Z

Time for a bisection?

$ git log --since=2021-02-14 --before=2021-02-19 --no-merges | grep -c '^commit'
171

irfansharif · 2021-03-15T14:34:57Z

What happened between 15th-18th Feb?

Here are all the changes that went in. Notably, #59992 is in there.

git log --author="[email protected]" --after="2021-02-15" --until="2021-02-18" --oneline

280ead9 Merge #60729
8b14771 Merge #60622
da04b55 Merge #60725
c76fe62 Merge #60704
86a44f7 Merge #60712
3c223f5 Merge #60634 #60696
ca1a1a7 Merge #58362 #59571 #60021
9171f18 Merge #60627 #60636
1a43d11 Merge #60541 #60619 #60638 #60698
d12d281 Merge #59992
c2ef500 Merge #60689 #60691
be115b9 Merge #60554 #60558 #60568
7afa90e Merge #60643
5744288 Merge #60512 #60594 #60648 #60666 #60679 #60681
889c27a Merge #60593 #60604
738e60a Merge #60654
e7bee86 Merge #60630 #60642 #60644
06486c9 Merge #60591
bededc2 Merge #56954
304f082 Merge #60150
8a92681 Merge #60629
fdfcf23 Merge #60618
516494a Merge #60631
7aa27ce Merge #60535
988e457 Merge #60452
1173f64 Merge #60160
db32968 Merge #60598
0c4fc8e Merge #60468
57220aa Merge #58908
dc90a0a Merge #60553 #60624
3c4aaab Merge #60302 #60600
959750d Merge #59865 #60546 #60561
497a9c0 Merge #60603
0c62e88 Merge #60495 #60543
e1911bc Merge #60548
aa8f949 Merge #60537 #60539 #60610
45db472 Merge #59861
14ac1cf Merge #60615
314afa3 Merge #59220 #60484 #60511
8f5dc42 Merge #60567
1e1f915 Merge #60474
a296e70 Merge #60581
4ec622e Merge #60504
68b86d5 Merge #60592
fe919cc Merge #60572
ef900a2 Merge #60497
d112e26 Merge #60556
3ca4d48 Merge #60255

Looking at the more recent failure, I think there's still an OOM in there.

W210313 09:30:32.422608 980 kv/kvserver/store_raft.go:533 â‹® [n3,s3,r218/3:â€¹/Table/59/1/85{2/300â€¦-4/703â€¦}â€º] 2964  handle raft ready: 1.0s [applied=1, batches=1, state_assertions=0]
W210313 09:28:47.499767 556 kv/kvserver/store_raft.go:533 â‹® [n2,s2,r859/2:â€¹/Table/59/1/14{2/543â€¦-4/952â€¦}â€º] 3254  handle raft ready: 0.6s [applied=3, batches=1, state_assertions=0]
W210313 09:30:32.202623 13604 kv/kvclient/kvcoord/dist_sender.go:1524 â‹® [n1,client=â€¹10.128.0.226:55340â€º,hostnossl,user=root] 2645  slow RPC response: slow RPC finished after 68.81s (1 attempts)

Here are the last logs from each node. n2's logs ended about 2m before the rest. We see periods of 0QPS in the workload:

  585.0s      248            0.0           34.2      0.0      0.0      0.0      0.0 delivery
  585.0s      248            0.0          330.7      0.0      0.0      0.0      0.0 newOrder
  585.0s      248            0.0           34.4      0.0      0.0      0.0      0.0 orderStatus
  585.0s      248            0.0          340.6      0.0      0.0      0.0      0.0 payment
  585.0s      248            0.0           33.7      0.0      0.0      0.0      0.0 stockLevel
  586.0s      248            0.0           34.1      0.0      0.0      0.0      0.0 delivery
  586.0s      248            0.0          330.2      0.0      0.0      0.0      0.0 newOrder
  586.0s      248            0.0           34.3      0.0      0.0      0.0      0.0 orderStatus
  586.0s      248            0.0          340.0      0.0      0.0      0.0      0.0 payment
  586.0s      248            0.0           33.7      0.0      0.0      0.0      0.0 stockLevel
  587.0s      248            0.0           34.1      0.0      0.0      0.0      0.0 delivery
  587.0s      248            0.0          329.6      0.0      0.0      0.0      0.0 newOrder
  587.0s      248            0.0           34.2      0.0      0.0      0.0      0.0 orderStatus
  587.0s      248            0.0          339.4      0.0      0.0      0.0      0.0 payment
  587.0s      248            0.0           33.6      0.0      0.0      0.0      0.0 stockLevel
  588.0s      248            0.0           34.0      0.0      0.0      0.0      0.0 delivery
  588.0s      248            0.0          329.0      0.0      0.0      0.0      0.0 newOrder
  588.0s      248            0.0           34.2      0.0      0.0      0.0      0.0 orderStatus
  588.0s      248            0.0          338.9      0.0      0.0      0.0      0.0 payment
  588.0s      248            0.0           33.6      0.0      0.0      0.0      0.0 stockLevel

One thing that's a bit more assuring however is that with #61777 there's basically no evidence of tracing memory usage in the heap profiles.

irfansharif · 2021-03-15T15:42:36Z

07:53:45 tpcc.go:911: --- SEARCH ITER FAIL: TPCC 2200 resulted in 15338.6 tpmC and failed due to efficiency value of 55.312023623038606 is below ppassing threshold of 85
07:53:45 tpcc.go:805: initializing cluster for 2188 warehouses (search attempt: 2)
07:53:45 test.go:196: test status: stopping cluster
07:53:45 cluster.go:387: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2776621-1615788126-27-n4cpu16:1-3
teamcity-2776621-1615788126-27-n4cpu16: stopping and waiting................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
0: exit status 255:

The fact that we're no longer efficient at 2200 warehouses is concerning. We used to be as of #55721. Roachperf suggests that we were that efficient before tpccbench started failing ~ Feb 15th (see here).

This test stops and restarts the cluster when for each tpcc run. It runs the entire workload multiple times to find the point at which we can sustain a given warehouse count at 85% efficiency. We've done something to make ourselves a lot less efficient at 2200 warehouses, and because that's still our estimated max (#55721), that's the point we start our search from.

Ignoring for a second that we want to pin down what exactly resulted in this inefficiency, because the point where we start our search (2200 warehouses) is so high, the test right away puts the cluster into overload territory. As seen above, the roachtest infrastructure itself will not play well with this kind of overload. I've filed #62010 to track that thread.

irfansharif · 2021-03-15T15:45:38Z

What I'm going to do now is to try running tpccbench at 2100 and step down as needed to see what the max efficiency we should expect to have. That should at least sidestep the lack of #62010. It'll also tell us exactly how much we've regressed. After that I'll try to isolate where the regression is coming from. I'd be a bit surprised if it was still due to #59992, especially given these tests are running with #61777 and now they have minimal tracing memory overhead (as shown in the profiles above).

irfansharif · 2021-03-15T17:57:00Z

Probably also worth running multiple runs of the experiment with different values of sql.txn_stats.sample_rate. But still, with 0.1 (10% of stmts sampled), the traces barely show up in profiles, so I'm wondering if that's what it is.

irfansharif · 2021-03-15T18:39:12Z

2100 warehouses passes with flying colors:

# create the cluster
roachprod create irfansharif-tpccbench -n 4 --clouds=gce --gce-machine-type=n1-highcpu-16 --lifetime=24h0m0s --local-ssd-no-ext4-barrier

# stage the crdb binary 
roachprod put irfansharif-tpccbench (which cockroach-linux-current) ./cockroach
roachprod run irfansharif-tpccbench:1 './cockroach version'

# bounce the cluster
roachprod stop irfansharif-tpccbench:1-3
roachprod start irfansharif-tpccbench:1-3

# import dataset
roachprod run irfansharif-tpccbench:1 './cockroach workload fixtures import tpcc --warehouses=2500 --checks=false'

# run the workload
roachprod run irfansharif-tpccbench:4 './cockroach workload run tpcc --warehouses=2500 --active-warehouses=2100 --tolerate-errors --ramp=5m0s --duration=10m0s --histograms=perf/warehouses=2100/stats.json {pgurl:1-3}'

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s    25363.4  93.9%    344.2    302.0    671.1    805.3   1073.7   2684.4

Trying 2150 next. We should just drop our max warehouse count.

irfansharif · 2021-03-15T19:03:12Z

So 2150 warehouses is one too many.

 540.0s        0          110.3          132.9  45097.2 103079.2 103079.2 103079.2 newOrder
  540.0s        0           18.0           13.9  33286.0  73014.4  81604.4  81604.4 orderStatus
  540.0s        0          154.4          137.4  38654.7  81604.4 103079.2 103079.2 payment
  540.0s        0           13.0           14.0  36507.2  85899.3  98784.2  98784.2 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  541.0s        0           14.0           13.9  26843.5  81604.4  90194.3  90194.3 delivery
  541.0s        0           82.0          132.8  47244.6 103079.2 103079.2 103079.2 newOrder
  541.0s        0            6.0           13.8  10737.4  51539.6  51539.6  51539.6 orderStatus
  541.0s        0          124.1          137.3  40802.2  85899.3 103079.2 103079.2 payment
  541.0s        0           16.0           14.0  42949.7  66572.0  81604.4  81604.4 stockLevel
  542.0s        0            5.0           13.9  32212.3  55834.6  55834.6  55834.6 delivery
  542.0s        0           54.0          132.6  34359.7 103079.2 103079.2 103079.2 newOrder
  542.0s        0            3.0           13.8  10200.5  28991.0  28991.0  28991.0 orderStatus
  542.0s        0          102.0          137.3  38654.7  85899.3 103079.2 103079.2 payment
  542.0s        0            3.0           14.0  34359.7  38654.7  38654.7  38654.7 stockLevel
  543.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 delivery
  543.0s        0            0.0          132.4      0.0      0.0      0.0      0.0 newOrder
  543.0s        0            0.0           13.8      0.0      0.0      0.0      0.0 orderStatus
  543.0s        0            1.0          137.0  49392.1  49392.1  49392.1  49392.1 payment
  543.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 stockLevel
  544.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 delivery
  544.0s        0            0.0          132.1      0.0      0.0      0.0      0.0 newOrder
  544.0s        0            0.0           13.8      0.0      0.0      0.0      0.0 orderStatus
  544.0s        0            0.0          136.8      0.0      0.0      0.0      0.0 payment
  544.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  545.0s        0            0.0           13.8      0.0      0.0      0.0      0.0 delivery

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s     7857.0  28.4%  43336.2  38654.7  90194.3 103079.2 103079.2 103079.2

Heh, I'm seeing the same failure mode as above, where the VM is just inoperable. roachprod {stop,status} irfansharif-tpccbench:1-3 just spins endlessly. Ok, I'll just lower the max warehouse count, this test is unsuitable at the limit without #62010.

cockroach-teamcity · 2021-03-16T12:38:25Z

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@e9387a6e5dfdad71c74ccd0a07c907632613fa3e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:768: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2780695-1615874329-25-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2780695-1615874329-25-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 2: exit status 255: 
		  | I210316 12:29:24.984505 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

roachtest: tpccbench/nodes=3/cpu=16 failed #55802 roachtest: tpccbench/nodes=3/cpu=16 failed C-test-failure O-roachtest O-robot T-bulkio branch-release-20.1

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

irfansharif · 2021-03-16T14:32:44Z

[x-post] From #62039 (comment):

Here's are a few 2200 warehouse runs with sql.txn_stats.sample_rate = 0.0.

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 600.0s     6871.7  25.1%  26787.5  16643.0  68719.5  98784.2 103079.2 103079.2

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 600.0s     5709.4  20.9%  36440.3  28991.0  81604.4 103079.2 103079.2 103079.2

62015: cli: Add some more warning comments to unsafe-remove-dead-replicas r=knz a=bdarnell The comments always said this tool was meant to be used with the supervision of a CRL engineer, but didn't otherwise make the risks and downsides clear. Add some more explicit warnings which can also serve as guidance for the supervising engineer. Release note: None 62039: roachtest: stabilize tpccbench/nodes=3/cpu=16 r=irfansharif a=irfansharif Fixes #61973. With tracing, our top-of-line TPC-C performance took a hit. Given that the TPC-C line searcher starts off at the estimated max, we're now starting off at "overloaded" territory; this makes for a very unhappy roachtest. Ideally we'd have something like #62010, or even admission control, to not make this test less noisy. Until then we can start off at a lower max warehouse count. This "fix" is still not a panacea, the entire tpccbench suite as written tries to nudge the warehouse count until the efficiency is sub-85%. Unfortunately, with our current infrastructure that's a stand-in for "the point where nodes are overloaded and VMs no longer reachable". See #61974. --- A longer-term approach to these tests could instead be as follows. We could start our search at whatever the max warehouse count is (making sure we've re-configure the max warehouses accordingly). These tests could then PASS/FAIL for that given warehouse count, and only if FAIL, could capture the extent of the regression by probing lower warehouse counts. This is in contrast to what we're doing today where we capture how high we can go (and by design risking going into overload territory, with no protections for it). Doing so lets us use this test suite to capture regressions from a given baseline, rather than hoping our roachperf dashboards capture unexpected perf improvements (if they're expected, we should update max warehouses accordingly). In the steady state, we should want the roachperf dashboards to be mostly flatlined, with step-increases when we're re-upping the max warehouse count to incorporate various system-wide performance increases. Release note: None Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: irfan sharif <[email protected]>

irfansharif self-assigned this Mar 15, 2021

irfansharif added branch-release-21.1 GA-blocker labels Mar 15, 2021

JuanLeon1 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 15, 2021

irfansharif mentioned this issue Mar 15, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [overload,closed ts regressing from X to Y] #61981

Closed

irfansharif mentioned this issue Mar 15, 2021

roachtest: stabilize tpccbench/nodes=3/cpu=16 #62039

Merged

cockroach-teamcity mentioned this issue Mar 16, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #55802

Closed

craig bot closed this as completed in 72c96fa Mar 16, 2021

nvanbenschoten mentioned this issue Mar 22, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed [overload] #62244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

cockroach-teamcity commented Mar 13, 2021

cockroach-teamcity commented Mar 14, 2021

cockroach-teamcity commented Mar 15, 2021

irfansharif commented Mar 15, 2021 •

edited

Loading

irfansharif commented Mar 15, 2021

tbg commented Mar 15, 2021

irfansharif commented Mar 15, 2021 •

edited

Loading

irfansharif commented Mar 15, 2021

irfansharif commented Mar 15, 2021

irfansharif commented Mar 15, 2021 •

edited

Loading

irfansharif commented Mar 15, 2021

irfansharif commented Mar 15, 2021

cockroach-teamcity commented Mar 16, 2021

irfansharif commented Mar 16, 2021

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

Comments

cockroach-teamcity commented Mar 13, 2021

cockroach-teamcity commented Mar 14, 2021

cockroach-teamcity commented Mar 15, 2021

irfansharif commented Mar 15, 2021 • edited Loading

irfansharif commented Mar 15, 2021

tbg commented Mar 15, 2021

irfansharif commented Mar 15, 2021 • edited Loading

irfansharif commented Mar 15, 2021

irfansharif commented Mar 15, 2021

irfansharif commented Mar 15, 2021 • edited Loading

irfansharif commented Mar 15, 2021

irfansharif commented Mar 15, 2021

cockroach-teamcity commented Mar 16, 2021

irfansharif commented Mar 16, 2021

irfansharif commented Mar 15, 2021 •

edited

Loading

irfansharif commented Mar 15, 2021 •

edited

Loading

irfansharif commented Mar 15, 2021 •

edited

Loading