Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

Closed
cockroach-teamcity opened this issue Mar 13, 2021 · 13 comments · Fixed by #62039
Closed

roachtest: tpccbench/nodes=3/cpu=16 failed #61973

cockroach-teamcity opened this issue Mar 13, 2021 · 13 comments · Fixed by #62039
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@4d44ddf24153d8ef8e0a996fdbe75ac5607f9574:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 1: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (5) 3: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 13, 2021
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@bdff5338ca725bf1cfddf7e3f648bbf02ab42999:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (5) 2: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (5) 2: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif irfansharif self-assigned this Mar 15, 2021
@irfansharif
Copy link
Contributor

irfansharif commented Mar 15, 2021

Looks like this is still getting tripped up after #61777 and #61965. Looking.

@irfansharif
Copy link
Contributor

Here's a refined view of this test's teamcity history on master. I'm filtering for only the GCE suite, which is where it fails consistently.

image

What happened between 15th-18th Feb?

@tbg
Copy link
Member

tbg commented Mar 15, 2021

Time for a bisection?

$ git log --since=2021-02-14 --before=2021-02-19 --no-merges | grep -c '^commit'
171

@irfansharif
Copy link
Contributor

irfansharif commented Mar 15, 2021

What happened between 15th-18th Feb?

Here are all the changes that went in. Notably, #59992 is in there.

git log --author="[email protected]" --after="2021-02-15" --until="2021-02-18" --oneline

280ead9 Merge #60729
8b14771 Merge #60622
da04b55 Merge #60725
c76fe62 Merge #60704
86a44f7 Merge #60712
3c223f5 Merge #60634 #60696
ca1a1a7 Merge #58362 #59571 #60021
9171f18 Merge #60627 #60636
1a43d11 Merge #60541 #60619 #60638 #60698
d12d281 Merge #59992
c2ef500 Merge #60689 #60691
be115b9 Merge #60554 #60558 #60568
7afa90e Merge #60643
5744288 Merge #60512 #60594 #60648 #60666 #60679 #60681
889c27a Merge #60593 #60604
738e60a Merge #60654
e7bee86 Merge #60630 #60642 #60644
06486c9 Merge #60591
bededc2 Merge #56954
304f082 Merge #60150
8a92681 Merge #60629
fdfcf23 Merge #60618
516494a Merge #60631
7aa27ce Merge #60535
988e457 Merge #60452
1173f64 Merge #60160
db32968 Merge #60598
0c4fc8e Merge #60468
57220aa Merge #58908
dc90a0a Merge #60553 #60624
3c4aaab Merge #60302 #60600
959750d Merge #59865 #60546 #60561
497a9c0 Merge #60603
0c62e88 Merge #60495 #60543
e1911bc Merge #60548
aa8f949 Merge #60537 #60539 #60610
45db472 Merge #59861
14ac1cf Merge #60615
314afa3 Merge #59220 #60484 #60511
8f5dc42 Merge #60567
1e1f915 Merge #60474
a296e70 Merge #60581
4ec622e Merge #60504
68b86d5 Merge #60592
fe919cc Merge #60572
ef900a2 Merge #60497
d112e26 Merge #60556
3ca4d48 Merge #60255


Looking at the more recent failure, I think there's still an OOM in there.

W210313 09:30:32.422608 980 kv/kvserver/store_raft.go:533 ⋮ [n3,s3,r218/3:‹/Table/59/1/85{2/300…-4/703…}›] 2964  handle raft ready: 1.0s [applied=1, batches=1, state_assertions=0]
W210313 09:28:47.499767 556 kv/kvserver/store_raft.go:533 ⋮ [n2,s2,r859/2:‹/Table/59/1/14{2/543…-4/952…}›] 3254  handle raft ready: 0.6s [applied=3, batches=1, state_assertions=0]
W210313 09:30:32.202623 13604 kv/kvclient/kvcoord/dist_sender.go:1524 ⋮ [n1,client=‹10.128.0.226:55340›,hostnossl,user=root] 2645  slow RPC response: slow RPC finished after 68.81s (1 attempts)

Here are the last logs from each node. n2's logs ended about 2m before the rest. We see periods of 0QPS in the workload:

  585.0s      248            0.0           34.2      0.0      0.0      0.0      0.0 delivery
  585.0s      248            0.0          330.7      0.0      0.0      0.0      0.0 newOrder
  585.0s      248            0.0           34.4      0.0      0.0      0.0      0.0 orderStatus
  585.0s      248            0.0          340.6      0.0      0.0      0.0      0.0 payment
  585.0s      248            0.0           33.7      0.0      0.0      0.0      0.0 stockLevel
  586.0s      248            0.0           34.1      0.0      0.0      0.0      0.0 delivery
  586.0s      248            0.0          330.2      0.0      0.0      0.0      0.0 newOrder
  586.0s      248            0.0           34.3      0.0      0.0      0.0      0.0 orderStatus
  586.0s      248            0.0          340.0      0.0      0.0      0.0      0.0 payment
  586.0s      248            0.0           33.7      0.0      0.0      0.0      0.0 stockLevel
  587.0s      248            0.0           34.1      0.0      0.0      0.0      0.0 delivery
  587.0s      248            0.0          329.6      0.0      0.0      0.0      0.0 newOrder
  587.0s      248            0.0           34.2      0.0      0.0      0.0      0.0 orderStatus
  587.0s      248            0.0          339.4      0.0      0.0      0.0      0.0 payment
  587.0s      248            0.0           33.6      0.0      0.0      0.0      0.0 stockLevel
  588.0s      248            0.0           34.0      0.0      0.0      0.0      0.0 delivery
  588.0s      248            0.0          329.0      0.0      0.0      0.0      0.0 newOrder
  588.0s      248            0.0           34.2      0.0      0.0      0.0      0.0 orderStatus
  588.0s      248            0.0          338.9      0.0      0.0      0.0      0.0 payment
  588.0s      248            0.0           33.6      0.0      0.0      0.0      0.0 stockLevel

One thing that's a bit more assuring however is that with #61777 there's basically no evidence of tracing memory usage in the heap profiles.

image

@JuanLeon1 JuanLeon1 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 15, 2021
@irfansharif
Copy link
Contributor

07:53:45 tpcc.go:911: --- SEARCH ITER FAIL: TPCC 2200 resulted in 15338.6 tpmC and failed due to efficiency value of 55.312023623038606 is below ppassing threshold of 85
07:53:45 tpcc.go:805: initializing cluster for 2188 warehouses (search attempt: 2)
07:53:45 test.go:196: test status: stopping cluster
07:53:45 cluster.go:387: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2776621-1615788126-27-n4cpu16:1-3
teamcity-2776621-1615788126-27-n4cpu16: stopping and waiting
0: exit status 255: 

The fact that we're no longer efficient at 2200 warehouses is concerning. We used to be as of #55721. Roachperf suggests that we were that efficient before tpccbench started failing ~ Feb 15th (see here).

This test stops and restarts the cluster when for each tpcc run. It runs the entire workload multiple times to find the point at which we can sustain a given warehouse count at 85% efficiency. We've done something to make ourselves a lot less efficient at 2200 warehouses, and because that's still our estimated max (#55721), that's the point we start our search from.

Ignoring for a second that we want to pin down what exactly resulted in this inefficiency, because the point where we start our search (2200 warehouses) is so high, the test right away puts the cluster into overload territory. As seen above, the roachtest infrastructure itself will not play well with this kind of overload. I've filed #62010 to track that thread.

@irfansharif
Copy link
Contributor

What I'm going to do now is to try running tpccbench at 2100 and step down as needed to see what the max efficiency we should expect to have. That should at least sidestep the lack of #62010. It'll also tell us exactly how much we've regressed. After that I'll try to isolate where the regression is coming from. I'd be a bit surprised if it was still due to #59992, especially given these tests are running with #61777 and now they have minimal tracing memory overhead (as shown in the profiles above).

@irfansharif
Copy link
Contributor

irfansharif commented Mar 15, 2021

Probably also worth running multiple runs of the experiment with different values of sql.txn_stats.sample_rate. But still, with 0.1 (10% of stmts sampled), the traces barely show up in profiles, so I'm wondering if that's what it is.

@irfansharif
Copy link
Contributor

2100 warehouses passes with flying colors:

# create the cluster
roachprod create irfansharif-tpccbench -n 4 --clouds=gce --gce-machine-type=n1-highcpu-16 --lifetime=24h0m0s --local-ssd-no-ext4-barrier

# stage the crdb binary 
roachprod put irfansharif-tpccbench (which cockroach-linux-current) ./cockroach
roachprod run irfansharif-tpccbench:1 './cockroach version'

# bounce the cluster
roachprod stop irfansharif-tpccbench:1-3
roachprod start irfansharif-tpccbench:1-3

# import dataset
roachprod run irfansharif-tpccbench:1 './cockroach workload fixtures import tpcc --warehouses=2500 --checks=false'

# run the workload
roachprod run irfansharif-tpccbench:4 './cockroach workload run tpcc --warehouses=2500 --active-warehouses=2100 --tolerate-errors --ramp=5m0s --duration=10m0s --histograms=perf/warehouses=2100/stats.json {pgurl:1-3}'
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s    25363.4  93.9%    344.2    302.0    671.1    805.3   1073.7   2684.4

Trying 2150 next. We should just drop our max warehouse count.

@irfansharif
Copy link
Contributor

So 2150 warehouses is one too many.

 540.0s        0          110.3          132.9  45097.2 103079.2 103079.2 103079.2 newOrder
  540.0s        0           18.0           13.9  33286.0  73014.4  81604.4  81604.4 orderStatus
  540.0s        0          154.4          137.4  38654.7  81604.4 103079.2 103079.2 payment
  540.0s        0           13.0           14.0  36507.2  85899.3  98784.2  98784.2 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  541.0s        0           14.0           13.9  26843.5  81604.4  90194.3  90194.3 delivery
  541.0s        0           82.0          132.8  47244.6 103079.2 103079.2 103079.2 newOrder
  541.0s        0            6.0           13.8  10737.4  51539.6  51539.6  51539.6 orderStatus
  541.0s        0          124.1          137.3  40802.2  85899.3 103079.2 103079.2 payment
  541.0s        0           16.0           14.0  42949.7  66572.0  81604.4  81604.4 stockLevel
  542.0s        0            5.0           13.9  32212.3  55834.6  55834.6  55834.6 delivery
  542.0s        0           54.0          132.6  34359.7 103079.2 103079.2 103079.2 newOrder
  542.0s        0            3.0           13.8  10200.5  28991.0  28991.0  28991.0 orderStatus
  542.0s        0          102.0          137.3  38654.7  85899.3 103079.2 103079.2 payment
  542.0s        0            3.0           14.0  34359.7  38654.7  38654.7  38654.7 stockLevel
  543.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 delivery
  543.0s        0            0.0          132.4      0.0      0.0      0.0      0.0 newOrder
  543.0s        0            0.0           13.8      0.0      0.0      0.0      0.0 orderStatus
  543.0s        0            1.0          137.0  49392.1  49392.1  49392.1  49392.1 payment
  543.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 stockLevel
  544.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 delivery
  544.0s        0            0.0          132.1      0.0      0.0      0.0      0.0 newOrder
  544.0s        0            0.0           13.8      0.0      0.0      0.0      0.0 orderStatus
  544.0s        0            0.0          136.8      0.0      0.0      0.0      0.0 payment
  544.0s        0            0.0           13.9      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  545.0s        0            0.0           13.8      0.0      0.0      0.0      0.0 delivery
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s     7857.0  28.4%  43336.2  38654.7  90194.3 103079.2 103079.2 103079.2

Heh, I'm seeing the same failure mode as above, where the VM is just inoperable. roachprod {stop,status} irfansharif-tpccbench:1-3 just spins endlessly. Ok, I'll just lower the max warehouse count, this test is unsuitable at the limit without #62010.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@e9387a6e5dfdad71c74ccd0a07c907632613fa3e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:768: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2780695-1615874329-25-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2780695-1615874329-25-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  |
		  | 2: exit status 255: 
		  | I210316 12:29:24.984505 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

[x-post] From #62039 (comment):

Here's are a few 2200 warehouse runs with sql.txn_stats.sample_rate = 0.0.

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 600.0s     6871.7  25.1%  26787.5  16643.0  68719.5  98784.2 103079.2 103079.2

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 600.0s     5709.4  20.9%  36440.3  28991.0  81604.4 103079.2 103079.2 103079.2

craig bot pushed a commit that referenced this issue Mar 16, 2021
62015: cli: Add some more warning comments to unsafe-remove-dead-replicas r=knz a=bdarnell

The comments always said this tool was meant to be used with the
supervision of a CRL engineer, but didn't otherwise make the risks
and downsides clear. Add some more explicit warnings which can also
serve as guidance for the supervising engineer.

Release note: None

62039: roachtest: stabilize tpccbench/nodes=3/cpu=16 r=irfansharif a=irfansharif

Fixes #61973. With tracing, our top-of-line TPC-C performance took a
hit. Given that the TPC-C line searcher starts off at the estimated max,
we're now starting off at "overloaded" territory; this makes for a very
unhappy roachtest.

Ideally we'd have something like #62010, or even admission control, to
not make this test less noisy. Until then we can start off at a lower
max warehouse count.

This "fix" is still not a panacea, the entire tpccbench suite as written
tries to nudge the warehouse count until the efficiency is sub-85%.
Unfortunately, with our current infrastructure that's a stand-in for
"the point where nodes are overloaded and VMs no longer reachable". 
See #61974.

---

A longer-term approach to these tests could instead be as follows.
We could start our search at whatever the max warehouse count is (making
sure we've re-configure the max warehouses accordingly). These tests
could then PASS/FAIL for that given warehouse count, and only if FAIL,
could capture the extent of the regression by probing lower warehouse
counts. This is in contrast to what we're doing today where we capture
how high we can go (and by design risking going into overload territory,
with no protections for it).

Doing so lets us use this test suite to capture regressions from a given
baseline, rather than hoping our roachperf dashboards capture
unexpected perf improvements (if they're expected, we should update max
warehouses accordingly). In the steady state, we should want the
roachperf dashboards to be mostly flatlined, with step-increases when
we're re-upping the max warehouse count to incorporate various
system-wide performance increases.

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in 72c96fa Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants