Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #36720

Closed
cockroach-teamcity opened this issue Apr 10, 2019 · 13 comments
Closed

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #36720

cockroach-teamcity opened this issue Apr 10, 2019 · 13 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/682c2f2f466bbf768545ca4687822206a63983ad

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1231772&tab=buildLog

The test failed on master:
	cluster.go:1688,tpcc.go:839,tpcc.go:525,test.go:1237: unexpected node event: 1: dead
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,tpcc.go:839,tpcc.go:525,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1231772-tpccbench-nodes-9-cpu-4-chaos-partition --oneshot --ignore-empty-nodes: exit status 1 10: skipped
		2: 3871
		3: 3749
		5: 3585
		9: 3915
		6: 4217
		7: 3603
		8: 3689
		4: 3633
		1: dead
		Error:  1: dead

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Apr 10, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 10, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9938cb1a2cca4c0350244f76845f0c61391d44a7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249130&tab=buildLog

The test failed on release-19.1:
	cluster.go:1255,tpcc.go:730,search.go:47,search.go:177,tpcc.go:725,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1249130-tpccbench-nodes-9-cpu-4-chaos-partition:1-9 returned:
		stderr:
		
		stdout:
		CH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b,rack=0 --join=34.73.224.171:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		Connection to 34.73.224.54 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1420
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190419 18:36:08.509076 1 cluster_synced.go:1502  command failed
		: exit status 1
	cluster.go:1688,tpcc.go:842,tpcc.go:543,test.go:1237: Goexit() was called
	cluster.go:953,context.go:89,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,tpcc.go:842,tpcc.go:543,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1249130-tpccbench-nodes-9-cpu-4-chaos-partition --oneshot --ignore-empty-nodes: exit status 1 10: skipped
		7: dead
		8: 7042
		3: 4974
		9: 6944
		5: 5497
		2: 5086
		4: 5092
		1: 5147
		6: 5040
		Error:  7: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e1b3d0fdf11783203e76e7a2b3add59e8562a58d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249112&tab=buildLog

The test failed on master:
	cluster.go:1688,tpcc.go:842,tpcc.go:543,test.go:1237: could not restart node :2: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1249112-tpccbench-nodes-9-cpu-4-chaos-partition:2 returned:
		stderr:
		
		stdout:
		COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --join=35.196.123.70:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		Connection to 35.231.69.29 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1420
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190419 18:43:25.191179 1 cluster_synced.go:1502  command failed
		: exit status 1
	cluster.go:953,context.go:89,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,tpcc.go:842,tpcc.go:543,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1249112-tpccbench-nodes-9-cpu-4-chaos-partition --oneshot --ignore-empty-nodes: exit status 1 2: dead
		10: skipped
		1: 6473
		7: 5993
		6: 6591
		9: 6558
		3: 6039
		4: 8319
		5: 6669
		8: 6158
		Error:  2: dead

@nvanbenschoten
Copy link
Member

CHAOS: 18:43:28 chaos.go:66: chaos stopping: could not restart node :2: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1249112-tpccbench-nodes-9-cpu-4-chaos-partition:2 returned:
stderr:

stdout:
COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --join=35.196.123.70:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
Connection to 35.231.69.29 closed by remote host.

github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1420
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333: 
I190419 18:43:25.191179 1 cluster_synced.go:1502  command failed
: exit status 1

I just opened #37001 to take a stab at addressing this Connection to 35.231.69.29 closed by remote host error.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7109d291e3b9edfa38264361f832cec14fff66ee

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1259219&tab=buildLog

The test failed on release-19.1:
	cluster.go:1782,tpcc.go:848,tpcc.go:549,test.go:1245: could not stop node :9: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1259219-tpccbench-nodes-9-cpu-4-chaos-partition:9 returned:
		stderr:
		
		stdout:
		teamcity-1259219-tpccbench-nodes-9-cpu-4-chaos-partition: stopping and waiting........................................................................................................................
		0: exit status 255: 
		I190425 06:27:20.735775 1 cluster_synced.go:1523  command failed
		: exit status 1

@tbg
Copy link
Member

tbg commented Apr 25, 2019

roachtest at 3ffc259 which had #37077 but not #37001, so there's hope.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1835,tpcc.go:848,tpcc.go:549,test.go:1253: could not restart node :5: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1266041-tpccbench-nodes-9-cpu-4-chaos-partition:5 returned:
		stderr:
		
		stdout:
		KROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --join=35.190.183.149:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		Connection to 35.237.238.212 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:400
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1447
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190430 16:51:30.680681 1 cluster_synced.go:1529  command failed
		: exit status 1
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:790,test.go:776,cluster.go:1835,tpcc.go:848,tpcc.go:549,test.go:1253: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1266041-tpccbench-nodes-9-cpu-4-chaos-partition --oneshot --ignore-empty-nodes: exit status 1 10: skipped
		5: dead
		1: 6670
		4: 6078
		9: 6435
		6: 6048
		3: 6274
		8: 6042
		7: 5704
		2: 6505
		Error:  5: dead

craig bot pushed a commit that referenced this issue Apr 30, 2019
37132: storage: remove spammy log about failed RemoveTarget simulation r=nvanbenschoten a=nvanbenschoten

We repeatedly see in the logs of clusters undergoing chaos:
```
W190425 06:17:19.438001 251 storage/allocator.go:639  [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.638430 289 storage/allocator.go:639  [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:19.837870 256 storage/allocator.go:639  [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
W190425 06:17:20.238586 276 storage/allocator.go:639  [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget()
```

For instance, it dominates the logs in #36720.

This was first picked up in #34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/29bf8a1eb0b5b943127206f49333337f7f3079b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1272533&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1833,tpcc.go:853,tpcc.go:554,test.go:1251: unexpected node event: 9: dead
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:774,cluster.go:1833,tpcc.go:853,tpcc.go:554,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1272533-tpccbench-nodes-9-cpu-4-chaos-partition --oneshot --ignore-empty-nodes: exit status 1 10: skipped
		4: dead
		9: dead
		2: dead
		6: dead
		8: 4003
		7: 4033
		1: 4183
		3: 4087
		5: 4118
		Error:  4: dead, 9: dead, 2: dead, 6: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279548&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1833,tpcc.go:853,tpcc.go:554,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1279548-tpccbench-nodes-9-cpu-4-chaos-partition:10 -- ./workload fixtures load tpcc --warehouses=2000 --split --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		0047d7c8 pc=0x466f61
		created by database/sql.OpenDB
			/usr/local/go/src/database/sql/sql.go:671 +0x15d
		
		goroutine 14 [select, 47 minutes]:
		runtime.gopark(0x15ed1e8, 0x0, 0x1809, 0x1)
			/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00047ddc0 sp=0xc00047dda0 pc=0x438f8b
		runtime.selectgo(0xc00047df68, 0xc00047df48, 0x2, 0x0, 0x0)
			/usr/local/go/src/runtime/select.go:313 +0xcc6 fp=0xc00047df20 sp=0xc00047ddc0 pc=0x448e66
		database/sql.(*DB).connectionResetter(0xc000612000, 0x19949e0, 0xc0006fa340)
			/usr/local/go/src/database/sql/sql.go:1014 +0xfb fp=0xc00047dfc8 sp=0xc00047df20 pc=0x4f89db
		runtime.goexit()
			/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00047dfd0 sp=0xc00047dfc8 pc=0x466f61
		created by database/sql.OpenDB
			/usr/local/go/src/database/sql/sql.go:672 +0x193
		bash: line 1:  4684 Aborted                 (core dumped) bash -c "./workload fixtures load tpcc --warehouses=2000 --split --scatter --checks=false --partitions=3 'postgres://[email protected]:26257?sslmode=disable'"
		Error:  exit status 134
		: exit status 1

@danhhz
Copy link
Contributor

danhhz commented May 8, 2019

Most recent failures are fixed by #37401

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8abb47a1c9795c1463183bc44e776b054bece682

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1279683&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1833,tpcc.go:853,tpcc.go:554,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1279683-tpccbench-nodes-9-cpu-4-chaos-partition:10 -- ./workload fixtures load tpcc --warehouses=2000 --split --scatter --checks=false --partitions=3 {pgurl:1} returned:
		stderr:
		
		stdout:
		3 +0x1 fp=0xc000474fd0 sp=0xc000474fc8 pc=0x466f61
		created by database/sql.OpenDB
			/usr/local/go/src/database/sql/sql.go:671 +0x15d
		
		goroutine 43 [timer goroutine (idle)]:
		runtime.gopark(0x15ed1a8, 0x28caec0, 0x2082c691414, 0x1)
			/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc0007a3f28 sp=0xc0007a3f08 pc=0x438f8b
		runtime.goparkunlock(0x28caec0, 0xc001051414, 0x1)
			/usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc0007a3f58 sp=0xc0007a3f28 pc=0x439033
		runtime.timerproc(0x28caec0)
			/usr/local/go/src/runtime/time.go:280 +0x288 fp=0xc0007a3fd8 sp=0xc0007a3f58 pc=0x456168
		runtime.goexit()
			/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0007a3fe0 sp=0xc0007a3fd8 pc=0x466f61
		created by runtime.(*timersBucket).addtimerLocked
			/usr/local/go/src/runtime/time.go:170 +0x114
		bash: line 1:  4744 Aborted                 (core dumped) bash -c "./workload fixtures load tpcc --warehouses=2000 --split --scatter --checks=false --partitions=3 'postgres://[email protected]:26257?sslmode=disable'"
		Error:  exit status 134
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/45e15e05abff25e099ca59f4c5cb40a6cf695e6d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpccbench/nodes=9/cpu=4/chaos/partition PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1285294&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1400,tpcc.go:741,search.go:47,search.go:177,tpcc.go:736,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false --racks=3 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition:1-9 returned:
		stderr:
		
		stdout:
		teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition: starting..................................................................................................................................
		1: exit status 255
		~ ./cockroach version
		Connection to 35.243.246.62 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:95
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:289
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1449
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190511 16:56:01.132749 1 cluster_synced.go:1531  command failed
		: exit status 1
	cluster.go:1833,tpcc.go:853,tpcc.go:554,test.go:1251: Goexit() was called
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:774,cluster.go:1833,tpcc.go:853,tpcc.go:554,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition --oneshot --ignore-empty-nodes: exit status 1 10: skipped
		2: dead
		4: 12105
		1: 10992
		8: 10519
		3: 11724
		6: 11415
		7: 11954
		5: 12169
		9: 11519
		Error:  2: dead

@tbg
Copy link
Member

tbg commented May 12, 2019

The first ssh invocation in the closure here returned the error, it's calling ./cockroach version (on n2/35.243.246.62).gi

c.Parallel(display, len(nodes), p, func(i int) ([]byte, error) {
vers, err := getCockroachVersion(c, nodes[i])

I think this is the server side of journalctl for the failed attempt. The log goes silent after that for almost a minute.

May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[3350]: debug1: Forked child 10926.
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: Set /proc/self/oom_score_adj to 0
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: rexec start in 5 out 5 newsock 5 pipe 7 sock 8
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: inetd sockets after dupping: 3, 3
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: Connection from 35.237.253.190 port 60742 on 10.142.0.19 port 22
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: Client protocol version 2.0; client software version OpenSSH_7.2p2 Ubuntu-4ubuntu2.6
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: match: OpenSSH_7.2p2 Ubuntu-4ubuntu2.6 pat OpenSSH* compat 0x04000000
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: Enabling compatibility mode for protocol 2.0
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: Local version string SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.6
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: permanently_set_uid: 110/65534 [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: list_hostkey_types: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519 [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: SSH2_MSG_KEXINIT sent [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: SSH2_MSG_KEXINIT received [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: kex: algorithm: [email protected] [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: kex: host key algorithm: ecdsa-sha2-nistp256 [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: kex: client->server cipher: [email protected] MAC: <implicit> compression: none [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: kex: server->client cipher: [email protected] MAC: <implicit> compression: none [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: expecting SSH2_MSG_KEX_ECDH_INIT [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: rekey after 134217728 blocks [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: SSH2_MSG_NEWKEYS sent [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: expecting SSH2_MSG_NEWKEYS [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: SSH2_MSG_NEWKEYS received [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: rekey after 134217728 blocks [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: KEX done [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: userauth-request for user root service ssh-connection method none [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: attempt 0 failures 0 [preauth]
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: PAM: initializing for "root"
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: PAM: setting PAM_RHOST to "35.237.253.190"
May 11 16:53:51 teamcity-1285294-tpccbench-nodes-9-cpu-4-chaos-partition-0002 sshd[10926]: debug1: PAM: setting PAM_TTY to "ssh"
[silence for almost 1 minute]
[unrelated log entries]

@tbg
Copy link
Member

tbg commented May 12, 2019

Going to close this out, didn't seem like any past posts here are useful. I'm landing #37483 to get more info in the future.

@tbg tbg closed this as completed May 12, 2019
tbg added a commit to tbg/cockroach that referenced this issue May 12, 2019
Stash the verbose debug logs in a file for each remote ssh session,
and surface it with the error should one occur.

For the latest in many past incidents where this would've been useful,
see:

cockroachdb#36720 (comment)

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 13, 2019
Stash the verbose debug logs in a file for each remote ssh session,
and surface it with the error should one occur.

For the latest in many past incidents where this would've been useful,
see:

cockroachdb#36720 (comment)

Release note: None
craig bot pushed a commit that referenced this issue May 13, 2019
37483: roachprod: more ssh{,d} logging r=ajwerner a=tbg

Stash the verbose debug logs in a file for each remote ssh session, and
surface it with the error should one occur. Crank up the server verbosity.

For the latest in many past incidents where this would've been useful, see:

#36720 (comment)

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants