Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

Closed
cockroach-teamcity opened this issue Dec 17, 2020 · 12 comments · Fixed by #58161
Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

cockroach-teamcity opened this issue Dec 17, 2020 · 12 comments · Fixed by #58161
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@eda9189cecbbc279f1857f6e6b992bdfd363397e:

		  | runtime.futexsleep(0xc000507148, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7fb7ecff8cb8 sp=0x7fb7ecff8c68 pc=0x44c9a6
		  | runtime.notesleep(0xc000507148)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7fb7ecff8cf0 sp=0x7fb7ecff8cb8 pc=0x42517f
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1924 +0xc5 fp=0x7fb7ecff8d18 sp=0x7fb7ecff8cf0 pc=0x4571c5
		  | runtime.findrunnable(0xc00005c000, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2485 +0xa7f fp=0x7fb7ecff8e00 sp=0x7fb7ecff8d18 pc=0x45899f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2683 +0x2d7 fp=0x7fb7ecff8e68 sp=0x7fb7ecff8e00 pc=0x459617
		  | runtime.park_m(0xc000103c80)
		  | 	/usr/local/go/src/runtime/proc.go:2851 +0x9d fp=0x7fb7ecff8e98 sp=0x7fb7ecff8e68 pc=0x459bbd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7fb7ecff8ea8 sp=0x7fb7ecff8e98 pc=0x4872db
		  | rax    0xca
		  | rbx    0xc000507000
		  | rcx    0x48af63
		  | rdx    0x0
		  | rdi    0xc000507148
		  | rsi    0x80
		  | rbp    0x7fb7ecff8ca8
		  | rsp    0x7fb7ecff8c60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000682600
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48af61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  6969 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2575/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:819: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 17, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@46919380225dba7122130c338744b561d7eb6c56:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2180,tpcc.go:747,search.go:43,search.go:173,tpcc.go:743,tpcc.go:566,test_runner.go:760: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2528188-1608274662-60-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2528188-1608274662-60-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 4: exit status 255: 
		  | I201218 14:22:18.690662 1 (gostd) cluster_synced.go:1732  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@a782ee8d93a23fc53eedada3c51893c19f7bb41e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@75efa1fe6f2096adc9db474026a2b7235e53c388:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten
Copy link
Member

The last few crashes have looked like:

I201220 12:24:18.630526 6288300 jobs/registry.go:1117 â‹® [n1] AUTO CREATE STATS job 617338016760102913: stepping through state succeeded with error: <nil>
E201220 12:24:18.665806 6288300 1@util/log/logcrash/crash_reporting.go:140 ⋮ [intExec=‹create-stats›] a panic has occurred!
interface conversion: tree.Datum is tree.dNull, not *tree.DBytes
(1) attached stack trace
  -- stack trace:
  | runtime.gopanic
  | 	/usr/local/go/src/runtime/panic.go:969
  | runtime.panicdottypeE
  | 	/usr/local/go/src/runtime/iface.go:261
  | runtime.panicdottypeI
  | 	/usr/local/go/src/runtime/iface.go:271
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).Update.func1
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/update.go:133
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:749
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:748
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).runInTxn
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:693
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).Update
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/update.go:108
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).succeeded
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:594
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1185
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1139
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:258
  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:853
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:347
  | runtime.goexit
  | 	/usr/local/go/src/runtime/asm_amd64.s:1374

cc. @lucy-zhang

@thoszhang
Copy link
Contributor

We last saw something like this in #55481. I think it means that the claim_session_id we got from this query is null:

stmt = "SELECT status, payload, progress, claim_session_id FROM system." +
"jobs WHERE id = $1 FOR UPDATE"

@ajwerner @spaskob any idea what's happening here?

@spaskob
Copy link
Contributor

spaskob commented Dec 21, 2020

I have to investigate further but if this indeed is NULL, then we have a broken invariant in the jobs system, namely all jobs that are running must have a non NULL claim_session_id.

@spaskob
Copy link
Contributor

spaskob commented Dec 21, 2020

It is a AUTO CREATE STATS job. Let me look into how these jobs are created.

@spaskob
Copy link
Contributor

spaskob commented Dec 21, 2020

I wonder if this recent change to jobs/update.go is relevant: #57742

@ajwerner
Copy link
Contributor

namely all jobs that are running must have a non NULL claim_session_id.

This invariant confuses me. The way we detect a job as available to be claimed is by setting the claim_session_id to NULL. Adoptable jobs start with a NULL value for claim_session_id as well. Lastly, we have a loop which is solely for the purpose of setting the claim_session_id to NULL when the session expires. We probably need to be more defensive when updating to note when the session claim is lost.

That being said, if that were the case, a session would have had to expire. Let me go through the logs.

@ajwerner
Copy link
Contributor

Yeah, that's exactly what happened. This cluster was pretty toast, everything was timing out and the node failed to heartbeat its sqlliveness session.

@ajwerner
Copy link
Contributor

At least it's an easy fix, will need to get backported.

@ajwerner
Copy link
Contributor

Actually, ignore the backport, this logic only exists on master.

craig bot pushed a commit that referenced this issue Dec 22, 2020
58161: jobs: fix panic when a job has lost its claim during update r=spaskob a=ajwerner

Jobs can lose their claims. We detect when the claim doesn't match a new
claim but we don't detect when the claim has been set to NULL (which is
effectively equivalent). This is important because the way we clear claims
is to set them to `NULL` and then we adopt jobs which have a `NULL` claim.
This bug was introduced in #55120 and has been released only in the 21.1
alpha.

Fixes #58049.

Release note (bug fix): Fixed a bug from the previous alpha where the binary
could crash if a running node lost its claim to a job while updating.

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 21c3ad1 Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants