Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=false failed #38772

Closed
cockroach-teamcity opened this issue Jul 9, 2019 · 66 comments
Closed

roachtest: clearrange/checks=false failed #38772

cockroach-teamcity opened this issue Jul 9, 2019 · 66 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/8c6fdc64908a13291e4ddc5d233bbbaa379e71a2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1378458&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190709-1378458/clearrange/checks=false/run_1
	test_runner.go:685: test timed out (6h30m0s)
	cluster.go:1724,clearrange.go:56,clearrange.go:35,test_runner.go:670: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1562652995-24-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190709 11:35:27.249688 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 9, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 9, 2019
@nvanbenschoten
Copy link
Member

Tons of RocksDB stalls like:

W190709 11:38:41.369612 17 storage/engine/rocksdb.go:116  [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:779] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216

@ajkr noticed this in #38095 (comment). I closed that issue because part of it was fixed, but this still needs to be tracked.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ca35fc4a0e2665e7f6efd945e65a0db97984fa7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1396096&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190719-1396096/clearrange/checks=false/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563517204-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190719 09:08:25.176389 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/22d48caaa7d39efdcef7b3c87a99fc421e1473af

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1397412&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190720-1397412/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1788,cluster.go:1807,cluster.go:1911,clearrange.go:110,clearrange.go:159,cluster.go:2069,errgroup.go:57: context canceled
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1ad0ecc8cbddf82c9fedb5a5c5e533e72a657ff7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1399000&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190722-1399000/clearrange/checks=false/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1563776264-15-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190722 09:39:37.537357 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7111a67b2ea3a19c2f312f8d214b8823f431cac0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1400942&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190723-1400942/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190724-1402541/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1788,cluster.go:1807,cluster.go:1911,clearrange.go:110,clearrange.go:159,cluster.go:2069,errgroup.go:57: context canceled
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog

The test failed on branch=provisional_201907241708_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190725-1404886/clearrange/checks=false/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564034590-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190725 09:16:41.993259 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9078c4e63c1bff1c3d220ee216000b0903dd4d65

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1406479&tab=buildLog

The test failed on branch=provisional_201907252112_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190726-1406479/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564100376-15-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190726 03:59:30.358940 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/cfdaadc3514e7e8660f6c009ba159fdfd604f0a8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1409070&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190727-1409070/clearrange/checks=false/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564208378-15-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190727 10:21:58.287597 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/65055d6c16bf9386d8c4f4f9cd23e0a848814dc9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411157&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190730-1411157/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1788,cluster.go:1807,cluster.go:1911,clearrange.go:110,clearrange.go:159,cluster.go:2069,errgroup.go:57: context canceled
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: Goexit() was called

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/92fef12128c997233d985d1c19e11faac005073f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1413388&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190731-1413388/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/da56c792e968574b8f1d9ef3fdb45d56a530221a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1415578&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190801-1415578/clearrange/checks=false/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564640260-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190801 10:32:20.045149 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5bd37e8eb58ca66b9293c234bc572411057fec3a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417287&tab=buildLog

The test failed on branch=provisional_201908012151_v19.2.0-alpha.20190729, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190802-1417287/clearrange/checks=false/run_1
	cluster.go:2090,clearrange.go:187,clearrange.go:35,test_runner.go:691: dial tcp 104.154.157.162:26257: connect: connection refused

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/175c5ada040fd0cbbf178636b1c551d5c2229ec4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417597&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190802-1417597/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564726582-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190802 09:31:09.129042 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3b9a95bd7eb2cfa6d544fe7217852a85ec3b76f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1422703&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190805-1422703/clearrange/checks=false/run_1
	cluster.go:1726,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564984076-17-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190805 09:30:39.198110 1 ccl/workloadccl/fixture.go:316  starting import of 1 tables
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3db89b230b0c41e399354cbeb78c1e82c8e30004

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1424320&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190806-1424320/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/51a6fdedf0ce1d1329d40d801a7deaf8206b6b07

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1428934&tab=buildLog

The test failed on branch=provisional_201908060405_v19.1.4, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190807-1428934/clearrange/checks=false/run_1
	cluster.go:1735,clearrange.go:56,clearrange.go:35,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1565218672-16-n10cpu4:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned:
		stderr:
		
		stdout:
		I190808 03:09:28.999008 1 ccl/workloadccl/cliccl/fixtures.go:324  starting import of 1 tables
		I190808 06:32:00.807565 14 ccl/workloadccl/fixture.go:516  imported bank (3h22m32s, 0 rows, 0 index entries, 0 B)
		Error: importing fixture: importing table bank: pq: communication error: rpc error: code = Canceled desc = context canceled
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/51a6fdedf0ce1d1329d40d801a7deaf8206b6b07

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1436116&tab=buildLog

The test failed on branch=provisional_201908060405_v19.1.4, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190812-1436116/clearrange/checks=false/run_1
	test_runner.go:706: test timed out (6h30m0s)
	cluster.go:2099,clearrange.go:187,clearrange.go:35,test_runner.go:691: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/01ee0704865391599abef3bbc89f462117f8007a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1445527&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190820-1445527/clearrange/checks=false/run_1
	test_runner.go:688: test timed out (6h30m0s)

@petermattis
Copy link
Collaborator

I backported cockroachdb/rocksdb#43 and cockroachdb/rocksdb#42 to RocksDB 5.17.2 and the import is chugging along now. #43 in particular could explain the consistency checker issue on 5.17.2. I'll need to run this a bunch of times to be sure, though.

@nvanbenschoten
Copy link
Member

I backported cockroachdb/rocksdb#43 and cockroachdb/rocksdb#42 to RocksDB 5.17.2 and the import is chugging along now.

It's possible that there was some surprising interaction between ingested ssts and ssts that came originally from the memtable, but note that neither here nor in #40213 did the dropped range deletion tombstone come from an ingested sst.

@petermattis
Copy link
Collaborator

It's possible that there was some surprising interaction between ingested ssts and ssts that came originally from the memtable, but note that neither here nor in #40213 did the dropped range deletion tombstone come from an ingested sst.

I think I might have confused the consistencyChecker complaining about stats needing to be refreshed with an actual consistency failure.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1488785&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190914-1488785/clearrange/checks=false/run_1
	cluster.go:2114,clearrange.go:187,clearrange.go:35,test_runner.go:688: pq: batch timestamp 1568507626.955369258,0 must be after replica GC threshold 1568508464.174751679,0

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/62b1678f652461bbc1aaf6bc2c0dd03105ce0ebe

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=clearrange/checks=false PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489712&tab=buildLog

The test failed on branch=40765, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190915-1489712/clearrange/checks=false/run_1
	test_runner.go:703: test timed out (6h30m0s)

ajkr added a commit to ajkr/cockroach that referenced this issue Sep 18, 2019
Picks up cockroachdb/rocksdb#56.

Release justification: This feature can cause a corruption where keys
deleted by range tombstones reappear (see cockroachdb#38772 and cockroachdb#40213), so it's
important we revert it.

Release note: None
craig bot pushed a commit that referenced this issue Sep 19, 2019
40899: c-deps: bump rocksdb to revert compaction snapshot refresh r=ajkr a=ajkr

Picks up cockroachdb/rocksdb#56.

Release justification: This feature can cause a corruption where keys
deleted by range tombstones reappear (see #38772 and #40213), so it's
important we revert it.

Release note: None

Co-authored-by: Andrew Kryczka <[email protected]>
@ajkr
Copy link
Contributor

ajkr commented Sep 19, 2019

Thanks for narrowing down the consistency check failure to a range deletion bug, @nvanbenschoten. I never would have figured that out.

@nvanbenschoten
Copy link
Member

Thank you for taking it the rest of the way! It feels good to knock down two release blockers with one stone.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Fix crashes by preventing replica ID change.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 23, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 24, 2019
We've seen instability recently due to invariants being violated as
replicas catch up across periods of being removed and re-added to a range.
Due to learner replicas and their rollback behavior this is now a relatively
common case. Rather than handle all of these various scenarios this PR prevents
them from occuring by actively removing replicas when we determine that they
must have been removed.

Here's a high level overview of the change:

 * Once a Replica object has a non-zero Replica.mu.replicaID it will not
   change.
   * In this commit however, if a node crashes it may forget that it learned
     about a replica ID.
 * If a raft message or snapshot addressed to a higher replica ID is received
   the current replica will be removed completely.
 * If a replica sees a ChangeReplicasTrigger which removes it then it
   completely removes itself while applying that command.
 * Replica.mu.destroyStatus is used to meaningfully signify the removal state
   of a Replica. Replicas about to be synchronously removed are in
   destroyReasonRemovalPending.

This hopefully gives us some new invariants:

 * There is only ever at most 1 Replica which IsAlive() for a range on a Store
   at a time.
 * Once a Replica has a non-zero ReplicaID is never changes.
   * This applies only to the in-memory object, not the store itself.
 * Once a Replica applies a command as a part of the range descriptor it will
   never apply another command as a different Replica ID or outside of the
   Range.
   * Corrolary: a Replica created as a learner will only ever apply commands
     while that replica is in the range.

The change also introduces some new complexity. Namely we now allow removal of
uninitialized replicas, including their hard state. This allows us to catch up
across a split even when we know the RHS must have been removed.

Fixes cockroachdb#40367.

Issue cockroachdb#38772 (comment)
manifests itself as the RHS not being found for a merge. This happens because
the Replica is processing commands to catch itself up while it is not in the
range. This is no longer possible.

Fixes cockroachdb#40257.

Issue cockroachdb#40257 is another case of a replica processing commands while it is not
in the range.

Fixes cockroachdb#40470.

Issue cockroachdb#40470 is caused by a RHS learning about its existence and removal
prior to a LHS processing a split. This case is now handled properly and is
tested.

Release justification: This commit is safe for 19.2 because it fixes release
blockers.

Release note (bug fix): Avoid internal re-use of Replica objects to fix the following crashes:

    cockroachdb#38772 "found rXXX:{-} [, next=0, gen=0?] in place of the RHS"
    cockroachdb#39796 "replica descriptor of local store not found in right hand side of split"
    cockroachdb#40470 "split trigger found right-hand side with tombstone"
    cockroachdb#40257 "snapshot widens existing replica, but no replica exists for subsumed key"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

9 participants