roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

cockroach-teamcity · 2020-12-17T21:49:24Z

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@eda9189cecbbc279f1857f6e6b992bdfd363397e:

		  | runtime.futexsleep(0xc000507148, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7fb7ecff8cb8 sp=0x7fb7ecff8c68 pc=0x44c9a6
		  | runtime.notesleep(0xc000507148)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7fb7ecff8cf0 sp=0x7fb7ecff8cb8 pc=0x42517f
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1924 +0xc5 fp=0x7fb7ecff8d18 sp=0x7fb7ecff8cf0 pc=0x4571c5
		  | runtime.findrunnable(0xc00005c000, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2485 +0xa7f fp=0x7fb7ecff8e00 sp=0x7fb7ecff8d18 pc=0x45899f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2683 +0x2d7 fp=0x7fb7ecff8e68 sp=0x7fb7ecff8e00 pc=0x459617
		  | runtime.park_m(0xc000103c80)
		  | 	/usr/local/go/src/runtime/proc.go:2851 +0x9d fp=0x7fb7ecff8e98 sp=0x7fb7ecff8e68 pc=0x459bbd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7fb7ecff8ea8 sp=0x7fb7ecff8e98 pc=0x4872db
		  | rax    0xca
		  | rbx    0xc000507000
		  | rcx    0x48af63
		  | rdx    0x0
		  | rdi    0xc000507148
		  | rsi    0x80
		  | rbp    0x7fb7ecff8ca8
		  | rsp    0x7fb7ecff8c60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000682600
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48af61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  6969 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2575/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:819: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55544 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2020-12-18T14:26:51Z

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@46919380225dba7122130c338744b561d7eb6c56:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2180,tpcc.go:747,search.go:43,search.go:173,tpcc.go:743,tpcc.go:566,test_runner.go:760: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2528188-1608274662-60-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2528188-1608274662-60-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 4: exit status 255: 
		  | I201218 14:22:18.690662 1 (gostd) cluster_synced.go:1732  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55544 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2020-12-19T12:32:33Z

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@a782ee8d93a23fc53eedada3c51893c19f7bb41e:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58083 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55544 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2020-12-20T12:33:42Z

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@75efa1fe6f2096adc9db474026a2b7235e53c388:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58083 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-20.2 release-blocker
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55544 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

nvanbenschoten · 2020-12-21T17:33:36Z

The last few crashes have looked like:

I201220 12:24:18.630526 6288300 jobs/registry.go:1117 â‹® [n1] AUTO CREATE STATS job 617338016760102913: stepping through state succeeded with error: <nil>
E201220 12:24:18.665806 6288300 1@util/log/logcrash/crash_reporting.go:140 â‹® [intExec=â€¹create-statsâ€º] a panic has occurred!
interface conversion: tree.Datum is tree.dNull, not *tree.DBytes
(1) attached stack trace
  -- stack trace:
  | runtime.gopanic
  | 	/usr/local/go/src/runtime/panic.go:969
  | runtime.panicdottypeE
  | 	/usr/local/go/src/runtime/iface.go:261
  | runtime.panicdottypeI
  | 	/usr/local/go/src/runtime/iface.go:271
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).Update.func1
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/update.go:133
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:749
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:748
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).runInTxn
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:693
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).Update
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/update.go:108
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Job).succeeded
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:594
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1185
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1139
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:258
  | github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).Start.func2
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:853
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
  | 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:347
  | runtime.goexit
  | 	/usr/local/go/src/runtime/asm_amd64.s:1374

cc. @lucy-zhang

thoszhang · 2020-12-21T21:08:50Z

We last saw something like this in #55481. I think it means that the claim_session_id we got from this query is null:

cockroach/pkg/jobs/update.go

Lines 111 to 112 in 75efa1f

    
           stmt = "SELECT status, payload, progress, claim_session_id FROM system." + 
        
           	"jobs WHERE id = $1 FOR UPDATE"

@ajwerner @spaskob any idea what's happening here?

spaskob · 2020-12-21T21:19:28Z

I have to investigate further but if this indeed is NULL, then we have a broken invariant in the jobs system, namely all jobs that are running must have a non NULL claim_session_id.

spaskob · 2020-12-21T21:23:01Z

It is a AUTO CREATE STATS job. Let me look into how these jobs are created.

spaskob · 2020-12-21T22:50:25Z

I wonder if this recent change to jobs/update.go is relevant: #57742

ajwerner · 2020-12-22T00:44:07Z

namely all jobs that are running must have a non NULL claim_session_id.

This invariant confuses me. The way we detect a job as available to be claimed is by setting the claim_session_id to NULL. Adoptable jobs start with a NULL value for claim_session_id as well. Lastly, we have a loop which is solely for the purpose of setting the claim_session_id to NULL when the session expires. We probably need to be more defensive when updating to note when the session claim is lost.

That being said, if that were the case, a session would have had to expire. Let me go through the logs.

ajwerner · 2020-12-22T00:51:58Z

Yeah, that's exactly what happened. This cluster was pretty toast, everything was timing out and the node failed to heartbeat its sqlliveness session.

ajwerner · 2020-12-22T00:52:22Z

At least it's an easy fix, will need to get backported.

ajwerner · 2020-12-22T00:54:23Z

Actually, ignore the backport, this logic only exists on master.

58161: jobs: fix panic when a job has lost its claim during update r=spaskob a=ajwerner Jobs can lose their claims. We detect when the claim doesn't match a new claim but we don't detect when the claim has been set to NULL (which is effectively equivalent). This is important because the way we clear claims is to set them to `NULL` and then we adopt jobs which have a `NULL` claim. This bug was introduced in #55120 and has been released only in the 21.1 alpha. Fixes #58049. Release note (bug fix): Fixed a bug from the previous alpha where the binary could crash if a running node lost its claim to a job while updating. Co-authored-by: Andrew Werner <[email protected]>

cockroach-teamcity mentioned this issue Dec 18, 2020

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58083

Closed

ajwerner mentioned this issue Dec 22, 2020

jobs: fix panic when a job has lost its claim during update #58161

Merged

craig bot closed this as completed in 21c3ad1 Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

cockroach-teamcity commented Dec 17, 2020

cockroach-teamcity commented Dec 18, 2020

cockroach-teamcity commented Dec 19, 2020

cockroach-teamcity commented Dec 20, 2020

nvanbenschoten commented Dec 21, 2020

thoszhang commented Dec 21, 2020

spaskob commented Dec 21, 2020

spaskob commented Dec 21, 2020

spaskob commented Dec 21, 2020

ajwerner commented Dec 22, 2020

ajwerner commented Dec 22, 2020

ajwerner commented Dec 22, 2020

ajwerner commented Dec 22, 2020

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049

Comments

cockroach-teamcity commented Dec 17, 2020

cockroach-teamcity commented Dec 18, 2020

cockroach-teamcity commented Dec 19, 2020

cockroach-teamcity commented Dec 20, 2020

nvanbenschoten commented Dec 21, 2020

thoszhang commented Dec 21, 2020

spaskob commented Dec 21, 2020

spaskob commented Dec 21, 2020

spaskob commented Dec 21, 2020

ajwerner commented Dec 22, 2020

ajwerner commented Dec 22, 2020

ajwerner commented Dec 22, 2020

ajwerner commented Dec 22, 2020