Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

Closed
cockroach-teamcity opened this issue Apr 21, 2019 · 60 comments · Fixed by #38465 or #38632
Closed

roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

cockroach-teamcity opened this issue Apr 21, 2019 · 60 comments · Fixed by #38465 or #38632
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/df200cbf3f407dbf349aa601ff9036b4dff88e83

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1252822&tab=buildLog

The test failed on release-2.1:
	cluster.go:1329,tpcc.go:168,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1252822-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		 60.8     79.7     96.5    117.4 newOrder
		     33s        0           19.0           17.0     11.5     17.8     19.9     19.9 orderStatus
		     33s        0          201.8          170.0     37.7     71.3     83.9    104.9 payment
		     33s        0           16.0           17.6     83.9    570.4    570.4    570.4 stockLevel
		     34s        0           22.0           16.1     75.5    121.6    142.6    142.6 delivery
		     34s        0          201.8           92.8     67.1    109.1    113.2    117.4 newOrder
		     34s        0           24.0           17.2     12.1     26.2     31.5     31.5 orderStatus
		     34s        0          205.8          171.0     39.8     67.1     79.7     92.3 payment
		     34s        0           19.0           17.6     92.3    520.1    570.4    570.4 stockLevel
		Error: error in payment: select by customer idfail: ERROR: Expected "ROLLBACK TO SAVEPOINT COCKROACH_RESTART": current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
		Error:  exit status 1
		: exit status 1
	cluster.go:1688,tpcc.go:178,tpcc.go:267,test.go:1237: Goexit() was called

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Apr 21, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 21, 2019
@tbg
Copy link
Member

tbg commented Apr 23, 2019

@jordanlewis mind taking a look?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
	tpcc.go:123,tpcc.go:267,test.go:1245: while fetching vno-pred: unexpected HTTP response from https://binaries.cockroachdb.com/cockroach-vno-pred.linux-amd64.tgz?ci=true: 404
		<html>
		<head><title>404 Not Found</title></head>
		<body>
		<h1>404 Not Found</h1>
		<ul>
		<li>Code: NoSuchKey</li>
		<li>Message: The specified key does not exist.</li>
		<li>Key: cockroach-vno-pred.linux-amd64.tgz</li>
		<li>RequestId: A9E91C0C130F8392</li>
		<li>HostId: R3/pgD6niu0QQTb+E80EJYIm7Un2C/lSNiCEiI0SslrfGkytC1WPdXuzjJjnt5MZyfbkvx33sYY=</li>
		</ul>
		<hr/>
		</body>
		</html>

@jordanlewis
Copy link
Member

Hmm, the original failure is an odd one. I think it means there was an error in a statement in a transaction but the client didn't do the right thing and roll back - instead it tried to do another statement as normal. @andreimatei does that match your guess?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog

The test failed on branch=master, cloud=gce:
	tpcc.go:123,tpcc.go:267,test.go:1253: while fetching vno-pred: unexpected HTTP response from https://binaries.cockroachdb.com/cockroach-vno-pred.linux-amd64.tgz?ci=true: 404
		<html>
		<head><title>404 Not Found</title></head>
		<body>
		<h1>404 Not Found</h1>
		<ul>
		<li>Code: NoSuchKey</li>
		<li>Message: The specified key does not exist.</li>
		<li>Key: cockroach-vno-pred.linux-amd64.tgz</li>
		<li>RequestId: F223A4DE9F314A95</li>
		<li>HostId: HHf2tR76HnOY317Mh7nW3N6quUIfLkWrkptnO+McZYQKp0Pvwd5o68awtyeC19r18z3D3yNm4qU=</li>
		</ul>
		<hr/>
		</body>
		</html>

@andreimatei
Copy link
Contributor

I think it means there was an error in a statement in a transaction but the client didn't do the right thing and roll back - instead it tried to do another statement as normal. @andreimatei does that match your guess?

Yup

The other failures are about us not having updated some infrastructure for 19.2 I think. I'll deal with it.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog

The test failed on branch=master, cloud=gce:
	tpcc.go:123,tpcc.go:267,test.go:1253: while fetching vno-pred: unexpected HTTP response from https://binaries.cockroachdb.com/cockroach-vno-pred.linux-amd64.tgz?ci=true: 404
		<html>
		<head><title>404 Not Found</title></head>
		<body>
		<h1>404 Not Found</h1>
		<ul>
		<li>Code: NoSuchKey</li>
		<li>Message: The specified key does not exist.</li>
		<li>Key: cockroach-vno-pred.linux-amd64.tgz</li>
		<li>RequestId: AABE61EA080F07BD</li>
		<li>HostId: o9EaVdetXcU5Ln+9PVItfIIZwLhJrH1vqeknSKysN0g5mw5t+kb2dtAcb2bP+A3xmNgaOXLR6Do=</li>
		</ul>
		<hr/>
		</body>
		</html>

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dab475d78289721714b9c3b5b649dbb4794fc01f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268212&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1476,tpcc.go:168,cluster.go:1814,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1268212-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		 0           17.0            9.5     13.1     26.2     26.2     26.2 orderStatus
		   4m28s        0          172.0           92.6     39.8     60.8     71.3     75.5 payment
		   4m28s        0           10.0            9.4    176.2    771.8    771.8    771.8 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   4m29s        0           17.0            9.4     79.7    100.7    109.1    109.1 delivery
		   4m29s        0          174.0           87.0     62.9     83.9     96.5    113.2 newOrder
		   4m29s        0           18.0            9.5     10.5     17.8     28.3     28.3 orderStatus
		   4m29s        0          190.0           93.0     37.7     58.7     67.1     79.7 payment
		   4m29s        0           20.0            9.5    192.9    771.8    771.8    771.8 stockLevel
		Error: error in payment: ERROR: Expected "ROLLBACK TO SAVEPOINT COCKROACH_RESTART": current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
		Error:  exit status 1
		: exit status 1
	cluster.go:1835,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1286052&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1286052-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		   123.9           56.5     62.9     92.3    109.1    113.2 newOrder
		   2m59s        0            9.0            6.4      9.4     15.7     15.7     15.7 orderStatus
		   2m59s        0          124.9           65.3     35.7     50.3     71.3     75.5 payment
		   2m59s        0            7.0            6.4    142.6    704.6    704.6    704.6 stockLevel
		    3m0s        0           19.0            6.3     71.3     92.3     92.3     92.3 delivery
		    3m0s        0          121.1           56.8     52.4     75.5     88.1     96.5 newOrder
		    3m0s        0           17.0            6.4     14.7     21.0     30.4     30.4 orderStatus
		    3m0s        0          120.1           65.6     28.3     46.1     62.9     79.7 payment
		    3m0s        0           13.0            6.5    201.3    771.8    771.8    771.8 stockLevel
		Error: error in payment: ERROR: Expected "ROLLBACK TO SAVEPOINT COCKROACH_RESTART": current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/46b82eb637a3f6a1ae563fcad7448fa5458f8298

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290161&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1290161-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		tatus reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3404, received 256784 bytes, in 443.1 seconds
		Bytes per second: sent 7.7, received 579.5
		debug1: Exit status 1
		: exit status 1
		: exit status 1
	cluster.go:1833,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

@jordanlewis
Copy link
Member

I think this is a bug in the TPCC load generator, not Cockroach, but I'm not sure.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1290143-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		atus reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3392, received 127788 bytes, in 220.2 seconds
		Bytes per second: sent 15.4, received 580.4
		debug1: Exit status 1
		: exit status 1
		: exit status 1
	cluster.go:1833,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

@tbg
Copy link
Member

tbg commented May 17, 2019

Error: error in stockLevel: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)

I need to move the ssh debug output above the error or it'll be annoying to debug these going forward.

@jordanlewis
Copy link
Member

@tbg do we expect this test to pass? I'm still not really sure what it does - that error means the cluster was overloaded enough for a time that a query failed.

@tbg
Copy link
Member

tbg commented May 17, 2019

We expect this to pass, the headroom test intentionally runs with a much lower warehouse count than we think we support (1300+).

14:14:16 tpcc.go:271: computed headroom warehouses of 909; running mixed with v19.1.0-rc.4

@jordanlewis
Copy link
Member

Oh, mixed is about versions, not headroom. OK.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@nvanbenschoten
Copy link
Member

Previous two issues addressed by #37701.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1810a4eaa07b412b2d0899d25bb16a28a2746d48

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1300948&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1510,tpcc.go:168,cluster.go:1848,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1300948-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		1.6    121.6 delivery
		1h35m33s        0          209.9          194.8     50.3     88.1    104.9    109.1 newOrder
		1h35m33s        0           18.0           19.5      8.4     15.2     15.7     15.7 orderStatus
		1h35m33s        0          196.9          195.2     23.1     52.4     58.7     92.3 payment
		1h35m33s        0           29.0           19.5     46.1     88.1     92.3     92.3 stockLevel
		1h35m34s        0           21.0           19.5     65.0    109.1    109.1    109.1 delivery
		1h35m34s        0          204.0          194.8     48.2     75.5     83.9    104.9 newOrder
		1h35m34s        0           19.0           19.5      9.4     16.3     41.9     41.9 orderStatus
		1h35m34s        0          207.0          195.2     22.0     37.7     48.2     67.1 payment
		1h35m34s        0           19.0           19.5     48.2     92.3    100.7    100.7 stockLevel
		Error: error in newOrder: missing stock row
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.202.225_2019-05-22T16:23:53Z: exit status 1
		: exit status 1
	cluster.go:1869,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

@nvanbenschoten
Copy link
Member

error in newOrder: missing stock row

@ajwerner this looks like the issue you just fixed. Has that change been backported to 19.1 yet?

@ajwerner
Copy link
Contributor

s/@ajwerner /@ajkr /

@cockroach-teamcity

This comment has been minimized.

@jeffrey-xiao
Copy link
Contributor

The most recent failures are due to the non-nullable fields in RangeDescriptor. The old nodes would fail the CPut because we don't preserve unknown fields.

@jeffrey-xiao jeffrey-xiao self-assigned this Jun 26, 2019
craig bot pushed a commit that referenced this issue Jun 26, 2019
38465: roachpb: Change RangeDescriptor.StickyBit and ReplicaDescriptor.Type to nullable r=jeffrey-xiao a=jeffrey-xiao

On a mixed version cluster, only having #38302 does not guarantee backwards compatibility for non-nullable fields in RangeDescriptor. Suppose the replicas of a range are of mixed version and the current
leaseholder of the range is on newest version. If the leaseholder splits that range and writes to the non-nullable fields in RangeDescriptor and the leaseholder transfers to a node on an older version without #38302, then all CPuts would fail on that node.

We must guarantee that #38302 is on all nodes before making the fields in RangeDescriptor non-nullable.

Fixes #36983.
Fixes #38471.

Release note: None

Co-authored-by: Jeffrey Xiao <[email protected]>
@cockroach-teamcity

This comment has been minimized.

craig bot pushed a commit that referenced this issue Jun 27, 2019
38465: roachpb: Change RangeDescriptor.StickyBit and ReplicaDescriptor.Type to nullable r=jeffrey-xiao a=jeffrey-xiao

On a mixed version cluster, only having #38302 does not guarantee backwards compatibility for non-nullable fields in RangeDescriptor. Suppose the replicas of a range are of mixed version and the current
leaseholder of the range is on newest version. If the leaseholder splits that range and writes to the non-nullable fields in RangeDescriptor and the leaseholder transfers to a node on an older version without #38302, then all CPuts would fail on that node.

We must guarantee that #38302 is on all nodes before making the fields in RangeDescriptor non-nullable.

Fixes #36983.
Fixes #38471.

Release note: None

Co-authored-by: Jeffrey Xiao <[email protected]>
@craig craig bot closed this as completed in #38465 Jun 27, 2019
@jeffrey-xiao jeffrey-xiao reopened this Jun 27, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-tpcc-mixed-headroom-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190628 20:39:25.890763 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190628 20:39:58.457979 100 ccl/workloadccl/fixture.go:476  loaded 53 KiB table warehouse in 32.566814041s (1000 rows, 0 index entries, 1.6 KiB)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: context canceled

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-tpcc-mixed-headroom-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190630 21:25:37.588769 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190630 21:25:38.663084 122 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 1.073816944s (100000 rows, 0 index entries, 7.2 MiB)
		I190630 21:25:48.892337 116 ccl/workloadccl/fixture.go:476  loaded 53 KiB table warehouse in 11.303133494s (1000 rows, 0 index entries, 4.7 KiB)
		I190630 21:27:52.817663 117 ccl/workloadccl/fixture.go:476  loaded 1006 KiB table district in 2m15.228537539s (10000 rows, 0 index entries, 7.4 KiB)
		I190630 21:36:49.056856 119 ccl/workloadccl/fixture.go:476  loaded 4.3 GiB table history in 11m11.467897932s (30000000 rows, 60000000 index entries, 6.5 MiB)
		I190630 21:39:49.067750 120 ccl/workloadccl/fixture.go:476  loaded 1.3 GiB table order in 14m11.478710462s (30000000 rows, 30000000 index entries, 1.6 MiB)
		I190630 21:40:13.594779 121 ccl/workloadccl/fixture.go:476  loaded 126 MiB table new_order in 14m36.0054959s (9000000 rows, 0 index entries, 147 KiB)
		: signal: killed
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: context canceled

@cockroach-teamcity

This comment has been minimized.

craig bot pushed a commit that referenced this issue Jul 2, 2019
38593: storage: sort InternalReplicas before comparing r=jeffrey-xiao a=jeffrey-xiao

Fixes an issue where the in-memory copy of RangeDescriptor is out-of-sync with the copy in kv.

Should resolve the failure in this #36983 (comment)

The import step of `tpcc/mixed-headroom/n5cpu16` passes on `2811e0f` + #38465 + this PR.

cc @asubiotto 

Co-authored-by: Jeffrey Xiao <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A for of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A lot of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
craig bot pushed a commit that referenced this issue Jul 3, 2019
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten

Fixes #34180.
Fixes #35493.
Fixes #36983.
Fixes #37108.
Fixes #37371.
Fixes #37384.
Fixes #37551.
Fixes #37879.
Fixes #38095.
Fixes #38131.
Fixes #38136.
Fixes #38549.
Fixes #38552.
Fixes #38555.
Fixes #38560.
Fixes #38562.
Fixes #38563.
Fixes #38569.
Fixes #38578.
Fixes #38600.

_A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following:

![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png)

We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case.

The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2c865eeb3e3b244468ffc509a62778bd1f46740f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1370685&tab=buildLog

The test failed on branch=provisional_201907021644_v19.2.0-alpha.20190701, cloud=gce:
	cluster.go:1511,tpcc.go:167,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1370685-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		      0.0      0.0      0.0 delivery
		   4m25s        0            0.0           17.4      0.0      0.0      0.0      0.0 newOrder
		   4m25s        0            0.0            2.1      0.0      0.0      0.0      0.0 orderStatus
		   4m25s        0            0.0           18.6      0.0      0.0      0.0      0.0 payment
		   4m25s        0            0.0            0.2      0.0      0.0      0.0      0.0 stockLevel
		   4m26s        0            0.0            1.9      0.0      0.0      0.0      0.0 delivery
		   4m26s        0            0.0           17.4      0.0      0.0      0.0      0.0 newOrder
		   4m26s        0            0.0            2.1      0.0      0.0      0.0      0.0 orderStatus
		   4m26s        0            0.0           18.6      0.0      0.0      0.0      0.0 payment
		   4m26s        0            0.0            0.2      0.0      0.0      0.0      0.0 stockLevel
		Error: error in payment: EOF
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.202.76.107_2019-07-03T05:49:56Z: exit status 1
		: exit status 1
	cluster.go:1870,tpcc.go:177,tpcc.go:271,test.go:1249: unexpected node event: 3: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,tpcc.go:271,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1370685-tpcc-mixed-headroom-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3748
		4: 3825
		3: dead
		1: 3953
		Error:  3: dead

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jul 3, 2019

n3 died but we failed to grab any logs from the nodes. We just see context canceled all the way down:

13:52:24 cluster.go:992: fetching debug zip
13:52:24 cluster.go:344: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod ssh teamcity-1370685-tpcc-mixed-headroom-n5cpu16:1 -- ./cockroach debug zip --url {pgurl:1} debug.zip
13:52:24 cluster.go:1006: ./cockroach debug zip failed: 
13:52:24 test.go:1193: failed to download debug zip: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod ssh teamcity-1370685-tpcc-mixed-headroom-n5cpu16:1 -- ./cockroach debug zip --url {pgurl:1} debug.zip: context canceled

It looks like that's because there was an 8-hour gap between detecting the crash and trying to grab any artifacts:

05:54:36 test.go:808: test failure: 	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,tpcc.go:271,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1370685-tpcc-mixed-headroom-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3748
		4: 3825
		3: dead
		1: 3953
		Error:  3: dead
05:54:48 cluster.go:1102: running (fast) consistency checks on node 2
13:52:08 cluster.go:270: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod destroy teamcity-1370685-tpcc-mixed-headroom-n5cpu16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
8 participants