Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #79568

Closed
cockroach-teamcity opened this issue Apr 7, 2022 · 7 comments · Fixed by #79774
Closed

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #79568

cockroach-teamcity opened this issue Apr 7, 2022 · 7 comments · Fixed by #79774
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 7, 2022

roachtest.tpccbench/nodes=9/cpu=4/chaos/partition failed with artifacts on release-22.1 @ 5d1063406d48c2713e0a7a8421f1946349b3db65:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:1868,tpcc.go:1153,tpcc.go:1163,search.go:43,search.go:173,tpcc.go:1159,tpcc.go:931,test_runner.go:875: one or more parallel execution failure
		(1) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE
		  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:2038
		  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Parallel
		  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1919
		  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start
		  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:167
		  | github.com/cockroachdb/cockroach/pkg/roachprod.Start
		  | 	github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:660
		  | main.(*clusterImpl).StartE
		  | 	main/pkg/cmd/roachtest/cluster.go:1826
		  | main.(*clusterImpl).Start
		  | 	main/pkg/cmd/roachtest/cluster.go:1867
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1153
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1163
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1159
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:931
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:875
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) one or more parallel execution failure
		Error types: (1) *withstack.withStack (2) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-15945

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 7, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 7, 2022
@nvanbenschoten
Copy link
Member

The test fails with a 255 error code from ./cockroach version --build-tag.

0: ~ ./cockroach version --build-tag: exit status 255
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.getCockroachVersion
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:88
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start.func1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:169
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  | 	github.com/cockroachdb/cockroach/pkg/roachprod/install/cluster_synced.go:1958

On node 1 itself (0 here), we see some strange behavior in the logs, which seem to have been corrupted at the end:

I220407 08:14:30.731573 36 util/admission/granter.go:926 ⋮ [n1] 23454  CPULoad switching to period ‹1ms›
I220407 08:14:31.732093 36 util/admission/granter.go:926 ⋮ [n1] 23455  CPULoad switching to period ‹250ms›
W220407 08:14:36.797841 179 2@rpc/pkg/rpc/clock_offset.go:216 ⋮ [n1,rnode=63,raddr=‹34.148.203.219:26257›,class=system,heartbeat] 23456  latency jump (prev avg 1.01ms, current 1.69ms)
W220407 08:14:40.268568 1391 2@rpc/pkg/rpc/clock_offset.go:216 ⋮ [n1,rnode=3,raddr=‹10.142.0.67:26257›,class=default,heartbeat] 23457  latency jump (prev avg 0.66ms, current 1.11ms)
W220407 08:14:40.586356 820 2@rpc/pkg/rpc/clock_offset.go:216 ⋮ [n1,rnode=1,raddr=‹10.142.0.170:26257›,class=default,heartbeat] 23458  latency jump (prev avg 1.36ms, current 6.00ms)
W220407 08:14:52.520764 3707 2@rpc/pkg/rpc/clock_offset.go:216 ⋮ [n1,rnode=6,raddr=‹10.142.0.182:26257›,class=system,heartbeat] 23459  latency jump (prev avg 0.65ms, current 11.51ms)
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

There's nothing in the dmesg or journalctl output.

I'm not going to close this as a test flake because it's an interesting failure more, but this isn't an issue in KV and it's also not a release blocker, so transferring to test eng to let them decide whether to dig deeper or track going forward.

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 11, 2022
@srosenberg
Copy link
Member

This looks like a potential race condition inside restart (tpcc.go). The timestamps in the log suggest that the StopE loop succeeds before some of the VMs are reset. Subsequently, Start fails to execute ./cockroach version --build-tag

08:14:55 cluster.go:659: test status: resetting cluster
08:15:08 cluster.go:659: test status: stopping nodes :1-9
teamcity-4820281-1649308765-22-n10cpu4: stopping
0: exit status 255: 
1: exit status 255: 
2: exit status 255: 
3: exit status 255: 
4: exit status 255: 
5: exit status 255: 
6: exit status 255: 
7: exit status 255: 
8: exit status 255: 
08:15:13 tpcc.go:1133: unable to stop cluster; retrying to allow vm to recover: cluster.StopE: one or more parallel execution failure
08:15:43 cluster.go:659: test status: stopping nodes :1-9
teamcity-4820281-1649308765-22-n10cpu4: stopping
08:15:44 cluster.go:659: test status: starting nodes :1-9
08:15:44 cockroach.go:166: teamcity-4820281-1649308765-22-n10cpu4: starting nodes
0: ~ ./cockroach version --build-tag: exit status 255

It's unclear as to why ./cockroach version --build-tag would ever fail unless the remote host is unavailable (exit status 255 is from ssh).

@tbg
Copy link
Member

tbg commented Apr 11, 2022

We should probably remove this reset step. This dates back to when we deployed CRDB without systemd and could reliably freeze the VM into oblivion.

tbg added a commit to tbg/cockroach that referenced this issue Apr 11, 2022
This was originally introduced when CRDB could push the VM to brown out
(sometimes forever), which lost us lost of signal from `tpccbench` (on top of
causing flakes).

However, for a long time now we've deployed CRDB in a cgroup that leaves enough
headroom for the OS to reliably oomkill CRDB if necessary, so `c.Reset` is no
longer necessary.

Fixes cockroachdb#79568.

Release note: None
@tbg
Copy link
Member

tbg commented Apr 11, 2022

#79774

@srosenberg
Copy link
Member

We should probably remove this reset step. This dates back to when we deployed CRDB without systemd and could reliably freeze the VM into oblivion.

Without reset, tpccbench will likely lead to performance regressions and higher variance (across runs). We have supporting evidence from the cloud report [1]. Essentially, the reset leads to more stable tpcc performance on the subsequent iteration because the subsequent iteration doesn't need to scan over garbage accumulated during previous iterations [2]. Incidentally, we couldn't schedule tpccbench in AWS because reset is not implemented in master, but I have a patch in my local.

[1] https://cockroachlabs.atlassian.net/browse/CRDB-13799
[2] #17229

@tbg
Copy link
Member

tbg commented Apr 12, 2022

Essentially, the reset leads to more stable tpcc performance on the subsequent iteration because the subsequent iteration doesn't need to scan over garbage accumulated during previous iterations [2]

[2] seems wholly unconnected to the problem of building up garbage that you linked. reset hard-reboots the VM. We retain the data directory. From the point of view of CRDB, there is no difference between restarting the process and resetting the VM.

@craig craig bot closed this as completed in fc84f79 Apr 12, 2022
@srosenberg
Copy link
Member

[2] seems wholly unconnected to the problem of building up garbage that you linked. reset hard-reboots the VM. We retain the data directory.

Ah, I see; that's because all roachtests use a persistent drive (PreferLocalSSD defaults to false) [1] whereas roachprod defaults to local ssd [2] (this is what confused me).

From the point of view of CRDB, there is no difference between restarting the process and resetting the VM.

Yep, although VM reset may alter the performance of the OS, e.g., different working set of SSTs in memory.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/spec/cluster_spec.go#L206
[2] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/vm/vm.go#L200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants