Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=4 failed [release-2.1] #32900

Closed
cockroach-teamcity opened this issue Dec 6, 2018 · 12 comments
Closed

roachtest: import/tpch/nodes=4 failed [release-2.1] #32900

cockroach-teamcity opened this issue Dec 6, 2018 · 12 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/1146a03cc217cb57bdddd795e2d2fe2806c64985

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1042719&tab=buildLog

The test failed on release-2.1:
	test.go:1025: test timed out (6h0m0s)
	test.go:630,cluster.go:1488,import.go:132: context canceled

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Dec 6, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 6, 2018
@tbg tbg assigned tbg and unassigned andreimatei Dec 6, 2018
@tbg
Copy link
Member

tbg commented Dec 6, 2018

release-2.1.

@tbg tbg changed the title roachtest: import/tpch/nodes=4 failed roachtest: import/tpch/nodes=4 failed [release-2.1] Dec 11, 2018
tbg added a commit to tbg/cockroach that referenced this issue Dec 11, 2018
…snapshots

Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

Fixes cockroachdb#32898.
Fixes cockroachdb#32900.
Fixes cockroachdb#32895.

/cc @cockroachdb/release

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/138369ec8091afb0ccabdc1710928b8a47f3797f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1050805&tab=buildLog

The test failed on release-2.1:
	test.go:1025: test timed out (6h0m0s)
	test.go:630,cluster.go:1486,import.go:132: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/153fcf347cd6a53517f607ec1e6c7c14193640dd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1052961&tab=buildLog

The test failed on release-2.1:
	test.go:630,test.go:642: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1052961-import-tpch-nodes-4 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		e: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181030 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script215292777 --project cockroach-ephemeral]
		Output: WARNING: Some requests generated warnings:
		 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181030' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181204'.
		
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/0c87b11cb99ba5c677c95ded55dcba385928474e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1054703&tab=buildLog

The test failed on release-2.1:
	test.go:1023: test timed out (6h0m0s)
	test.go:628,cluster.go:1486,import.go:132: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7efc92a4dec689efc855ecd382a6f6b6065b98ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1055192&tab=buildLog

The test failed on master:
	test.go:628,cluster.go:1486,import.go:132: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.3: row 71049066: reading CSV record: read tcp 10.128.0.13:33442->74.125.124.128:443: read: connection reset by peer

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/334cce5d61b32b0bb4a300668522c38fb9d6b96d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1056651&tab=buildLog

The test failed on master:
	test.go:628,cluster.go:1486,import.go:132: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.4: row 8016572: reading CSV record: read tcp 10.128.0.33:43298->64.233.182.128:443: read: connection reset by peer

@tbg
Copy link
Member

tbg commented Dec 18, 2018

For the last real test failure (https://teamcity.cockroachdb.com/viewLog.html?buildId=1054703&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_WorkloadNightly#testNameId7244833067432458304), it doesn't seem that this is the usual kind of problem where something happens with snapshots.

The disk usage when the 6h timeout hits indicates that there hasn't been import activity yet:

diskusage: 05:08:44 restore.go:213: n#2: 55 GiB, n#1: 52 GiB, n#3: 58 MiB, n#4: 30 MiB

and looking through the logs they look like an idle cluster, which is the telltale sign of the restore phase not having started yet.

@mjibson can I toss this your way?

@tbg
Copy link
Member

tbg commented Dec 18, 2018

(There's no doubt also a different failure mode here that does involve core)

@petermattis
Copy link
Collaborator

Fixed by #33228.

@andreimatei
Copy link
Contributor

@petermattis what made you think this is fixed by that PR? The failure seems different from the other tests @tbg closed with that PR. No?

@tbg
Copy link
Member

tbg commented Dec 19, 2018

No, they look the same: imports timing out on 2.1. Not sure why I missed closing this one.

@petermattis
Copy link
Collaborator

@andreimatei The Fixes lines in #33228 explicitly mention this issue. I was just following @tbg's wishes. I'm not sure why it wasn't closed automatically when that PR merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants