roachtest: alterpk-tpcc failed #45812

cockroach-teamcity · 2020-03-06T18:11:53Z

(roachtest).alterpk-tpcc failed on master@954fe69d554162aec0fbc001aad1fe5103d8df13:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200306-1790595/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2020-03-07T18:15:28Z

(roachtest).alterpk-tpcc failed on master@752dea867f3aeb142e98c22f8d320ce19041aa8d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200307-1793930/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2020-03-08T18:21:28Z

(roachtest).alterpk-tpcc failed on master@dfa5bd527ae7d7373dd03c62118df87a87a77130:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200308-1795062/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

rohany · 2020-03-09T02:55:00Z

cc @lucy-zhang is this failure related to the new sc + jobs PR? Looking at the logs, it seems like there is some deadlock while trying to run the schema change. Each node has these repeated logs:

I200308 08:44:55.878309 181 sql/schema_changer.go:973  [n1,scExec] schema change on customer (60 v11) mutation 2 starting execution...
I200308 08:44:55.878579 181 sql/schema_changer.go:1025  [n1,scExec] schema change on customer (60 v11) mutation 2: another node is currently operating on this table

and no progress is made for 10 hours

thoszhang · 2020-03-09T04:32:02Z

954fe69 doesn't have those changes (and I rolled them back on master anyway, for the time being, to deal with a problem with migrations).

That log message is coming from here:

cockroach/pkg/sql/schema_changer.go

Line 1026 in 954fe69

    
           "schema change on %s (%d v%d) mutation %d: another node is currently operating on this table",

So it's a problem with trying to acquire a schema change lease, which hopefully should be solved once I do merge that PR. (I don't know what caused the deadlock, though.) Let's stress this test again once that PR is merged.

cockroach-teamcity · 2020-03-09T17:28:17Z

(roachtest).alterpk-tpcc failed on master@c473f40078994551cebcbe00fdbf1fa388957658:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200309-1796240/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

rohany · 2020-03-09T21:15:00Z

I'm investigating this issue a little more and don't understand whats happening in the logs.

We see here

I200309 07:40:00.803524 188887 sql/backfill.go:176  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] Running backfill for "customer", v=11, m=2

That a backfill has started for our index. However, after that we see that all the nodes start trying to grab the lease:

Then, we see that validation for the index backfills have completed:

I200309 08:35:16.147475 1091744 sql/backfill.go:1332  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: index customer/customer_c_w_id_c_d_id_c_id_key row count = 15000000, time so far 20.41641794s
I200309 08:35:17.720733 1091745 sql/backfill.go:1332  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: index customer/customer_idx_rewrite_for_primary_key_change row count = 15000000, time so far 21.988191096s
I200309 08:35:18.303016 28 server/status/runtime.go:498  [n1] runtime stats: 8.7 GiB RSS, 485 goroutines, 224 MiB/5.1 GiB/1.6 GiB GO alloc/idle/total, 4.3 GiB/5.5 GiB CGO alloc/total, 2577.0 CGO/sec, 326.6/11.7 %(u/s)time, 0.0 %gc (19x), 781 MiB/3.5 MiB (r/w)net
I200309 08:35:21.004309 1091746 sql/backfill.go:1376  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: table customer row count = 15000000, took 25.276541782s

However, it then doesn't look like we make it past SchemaChanger.done, because we then see these log messages:

I200309 07:41:06.716590 30 sql/schema_changer.go:973  [n1,scExec] schema change on customer (59 v11) mutation 2 starting execution...
I200309 07:41:06.728168 30 sql/schema_changer.go:1025  [n1,scExec] schema change on customer (59 v11) mutation 2: another node is currently operating on this table

But it also doesn't look like we started to rollback the schema change either.

thoszhang · 2020-03-09T23:29:14Z

@rohany Now that the new schema change job is merged, I think we should wait a few days and see if we get any more failures. The thing that's happening here (being unable to acquire a schema change lease) shouldn't be possible anymore, since schema change leases won't exist, though maybe there's some deeper underlying problem that would also cause problems with the new schema change job.

cockroach-teamcity · 2020-03-10T17:22:07Z

(roachtest).alterpk-tpcc failed on master@72c4a1bd411f2f82bf9aaa22883821a946614148:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200310-1799071/alterpk-tpcc/run_1
	test_runner.go:779: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

rohany · 2020-03-10T17:31:00Z

I think this failure above still doesn't have the thew sc jobs commit yet.

cockroach-teamcity · 2020-03-11T17:33:44Z

(roachtest).alterpk-tpcc failed on master@5570c01402796edb7cd06eb8ce7f615371f22d42:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200311-1801614/alterpk-tpcc/run_1
	test_runner.go:779: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

This PR adds some extra logging to the backfiller and alterpk roachtest to have more insight on where cockroachdb#45812 is failing. Release justification: non production code change Release note: None

rohany · 2020-03-11T19:45:42Z

Its unclear from these logs what exactly is going wrong, though the situation still seems similar to the logs before the schema changes jobs pr. One thing is that 500 warehouses is probably too big to be running on a 3 node 4 cpu cluster, but it doesn't explain the hang.

46001: roachtest: add extra logging to alterpk-tpcc roachtest r=yuzefovich a=rohany This PR adds some extra logging to the backfiller and alterpk roachtest to have more insight on where #45812 is failing. Release justification: non production code change Release note: None Co-authored-by: Rohan Yadav <[email protected]>

cockroach-teamcity · 2020-03-12T08:35:59Z

(roachtest).alterpk-tpcc failed on master@793a9200c16693aff32aa6a4dd9d8bbcbddb30aa:

		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_082527.690_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1804460-1583998462-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200312 08:25:29.248938 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 27.383279ms
		    I200312 08:25:37.613422 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 8.364418946s
		    I200312 08:25:39.721125 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.10765782s
		    I200312 08:26:23.891371 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 44.170123384s
		    I200312 08:28:16.731792 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 1m52.840272696s
		    I200312 08:35:42.761535 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 7m26.029484569s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:801: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1804460-1583998462-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4164
		2: 4753
		1: dead
		Error:  1: dead

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2020-03-13T09:04:17Z

(roachtest).alterpk-tpcc failed on master@69dc87d68addedf2fabfb2b14c098cfb35b5f3d0:

		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_074140.622_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1806460-1584083876-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200313 07:41:42.211690 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 34.938153ms
		    I200313 07:41:49.842967 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 7.631204223s
		    I200313 07:41:51.237037 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.393985549s
		    I200313 07:43:56.072945 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 2m4.835669018s
		    I200313 07:45:41.145017 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 1m45.07188352s
		    I200313 09:04:01.195185 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 1h18m20.049886518s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:807: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1806460-1584083876-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 5069
		3: 4474
		1: dead
		Error:  1: dead

More

Artifacts: /alterpk-tpcc

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

rohany · 2020-03-14T17:01:34Z

Hmm, latest failure just looks like a random node failure. Haven't seen the deadlock in the past 2 runs...

yuzefovich · 2020-04-04T03:37:00Z

Eh, the OOM can be reliably reproduced on a 3 node roachprod cluster with default n1-standard-4 nodes after importing 500 warehouses and running the query 3.3.2.6 manually (so the alteration of primary key is not needed). I tried to add more logging to try to understand where we're not accounting for the memory properly but still don't see it.

Here is an explain analyze (on an occasion it didn't crash) of the query for the context on amount of data flying around:

We seem to be estimating the memory usage of the aggregators relatively good now. For example, here is one of the heap profiles when first stage aggregator finishes and the second one is about to:

And at about this time in the logs we have:

[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-5: bytes usage increases to 1.0 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-3, bytes usage max 170 MiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-3, bytes usage max 1.9 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-5: bytes usage increases to 128 MiB (+10240)

In all of the crashes I've observed we successfully get to the point where hash join is being performed and actually get pretty far in the join's execution. However, RSS jumps quite significantly (say from 13 GiB RSS, 203 goroutines, 7.3 GiB/2.5 GiB/10 GiB GO alloc/idle/total to 14 GiB RSS, 201 goroutines, 9.2 GiB/720 MiB/10 GiB GO alloc/idle/total in 10 seconds), but this jump is not reflected in the heap profiles. It is possible that OS reports RSS with a delay, but I'm very puzzled by this.

I'm tired of looking at this issue and wondering whether someone else should take a stab at it. Here are a few suggestions:

I added some logging and pushed the changes to https://github.com/yuzefovich/cockroach/tree/debug-alterpk-tpcc
before running the query I disable auto stats and set some verbosity (on each node):

set cluster setting sql.stats.automatic_collection.enabled=false;
select crdb_internal.set_vmodule('hashjoiner=1,bytes_usage=1,hash_row_container=2,row_based_flow=1');

I start Cockroach cluster with custom memory profile interval

roachprod start yahor-test:1-3 --env="COCKROACH_MEMPROF_INTERVAL=3s"

rytaft · 2020-04-04T12:35:11Z

Hi @yuzefovich -- sorry for my slow reply to you earlier -- since you've disabled auto stats, does this mean you've ruled out histograms as the cause? (You can also explicitly disable histograms using set cluster setting sql.stats.histogram_collection.enabled = false;.)

yuzefovich · 2020-04-04T15:07:33Z

I still think there is memory misaccounting while performing automatic stats collection, but the crashes occur (seems like with lower likelihood) with auto stats disabled, so I don't think they are the root of cause of these OOMs.

rytaft · 2020-04-04T15:40:19Z

Ok thanks! I'll investigate the memory accounting of stats collection on Monday (unless there is some urgency to do it sooner...).

yuzefovich · 2020-04-04T16:01:31Z

Thanks! I don't think it's urgent since it seems to me that generally we're accounting for "permanently" used memory while automatically collecting stats, but I'm concerned about "temporary" allocations that could spike - I just don't know how big the spikes can be.

cockroach-teamcity · 2020-04-04T20:06:45Z

(roachtest).alterpk-tpcc failed on master@82fec00c83d4bfe35b906264ccb568568cec15b7:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_082033.550_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1851668-1585985862-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200404 08:20:35.396562 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 94.007016ms
		    I200404 08:20:43.585080 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 8.188439543s
		    I200404 08:20:46.075089 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.489950798s
		    I200404 08:22:28.782320 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m42.707109563s
		    I200404 08:24:41.019144 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 2m12.236664801s
		    I200404 08:32:19.429694 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 7m38.410327593s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

roachtest: alterpk-tpcc failed #46920 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2020-04-05T08:17:31Z

(roachtest).alterpk-tpcc failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_080522.569_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1853455-1586071414-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200405 08:05:24.183469 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 53.342763ms
		    I200405 08:05:31.785387 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 7.601842022s
		    I200405 08:05:34.392096 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.606617715s
		    I200405 08:06:40.709953 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m6.317813813s
		    I200405 08:08:40.770217 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 2m0.060106663s
		    I200405 08:16:52.169000 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 8m11.398593236s
		    Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

roachtest: alterpk-tpcc failed #46920 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

cockroach-teamcity · 2020-04-06T20:06:47Z

(roachtest).alterpk-tpcc failed on master@beac4a53e0e2e2236eb5957f67abc1bf476ad1b6:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_075641.938_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1854541-1586157738-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200406 07:56:43.703421 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 66.908502ms
		    I200406 07:56:49.005046 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 5.30152617s
		    I200406 07:56:51.492729 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.487643063s
		    I200406 07:58:07.460154 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m15.967279962s
		    I200406 08:00:05.841061 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 1m58.380748316s
		    I200406 11:12:44.203253 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3h12m38.361899276s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

roachtest: alterpk-tpcc failed #46920 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None

yuzefovich · 2020-04-07T18:56:21Z

Unfortunately, this crash seems like a regression. Using 19.2.5 I created 3 node cluster with default node sizes, imported 500 warehouses, waited for auto stats to be collected, and then ran first query from 3.3.2.6 check manually twice, and both times it succeeded in about 4 minutes. Then I stopped the cluster and restarted it using the current master, but when I executed the query, it crashed on the first try. This issue deserves more investigation. cc @jordanlewis

cockroach-teamcity · 2020-04-07T20:06:57Z

(roachtest).alterpk-tpcc failed on master@2032dafccfa311c7538960e974953cb9dc1d4e50:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_081029.241_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1857109-1586244496-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200407 08:10:30.874469 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 45.088823ms
		    I200407 08:10:41.353822 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 10.479249493s
		    I200407 08:10:43.551308 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.19743782s
		    I200407 08:11:54.442385 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m10.891018778s
		    I200407 08:13:56.825404 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 2m2.382769241s
		    I200407 08:17:27.545161 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3m30.719600577s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

roachtest: alterpk-tpcc failed #46920 roachtest: alterpk-tpcc failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

yuzefovich · 2020-04-07T22:38:00Z

Here is EXPLAIN ANALYZE of the query on 19.2.5 version:
https://cockroachdb.github.io/distsqlplan/decode.html#eJzEmG1P20oWx9_vp7DmFaxMPOfM44m0EtyWvYuWJl2g2tvdi6IU-0LUJGZto1tU9buv7ISS2GMzeSh904LtmfnPf86c3zl8Zfn_pqzPTn97f35yNghOBifnH_9zGhy8Pbu8uvzX-WFwcHl6fvrmKkhHf44mcRiko3j5_-LfdDq6mRfB3y-G74LfWZrFSfY7C4YXb08vgl8-OocFb08v3xwGp7-9OX1_FZycnz8vMn36fPr0_XS0WOkmfZgXB389XKxUrTOaTuZJ8OvF8MP7cqmD1tGHK3ravlmIYiGbp3EyGM-SnPX_y4CFDFnIBLsO2X2W3iR5nmblq6_Vh2fxF9bnIZvM7x-K8vF1yG7SLGH9r6yYFNOE9dnV-NM0uUjGcZJFnIUsTorxZFpNf59NZuPs8bjaDgvZ5f14nveDo4hHEAnOeTCex4EM0uIuyXIWsuFD0Q-ORXiM4TGEx4aFLEv_zIMsGcf9QEnJSZR682I8nQbFZJb0A4CeRSm1Kif49FgkT5-jNMG7yS_s-lvI0odiuYNn4Z8eg7txfrcuuVwWw1KCZNffrqulbhPWh2_hdpaI9fmfT3aUF-nN59Efn0eT-Mvx8_NVnyhCbgRGFJlIII90ZRjoumNL0et2SZLKKC5rfknbI6sMmLpf0LPBrw2_ni3ATSw4ub3NkttxkWaRdjvMQnYy-DgaDK9Ggw_n5wfHcNh4hM1Honz0ZvhhcDW6GP778uDQa9Mww54lZQ2Wu56NvwSzZJZmj8FDnsT9QAvaPlTW4kS0mvQ828O8Ou0kXpuvmqXDRtq_jZcf3o3OBlcHx_Kweflk7fIJSYrbmq84A-ihAEBwGyv5_u6g3MHbcCv7_zHO7y6TYngfAa5rvHq8T_orOZ6FbJr8URysaD_8Wza5vVt_VL-xC5vLkcGq15yDBf70Zi2QRU9IFFyWdlfzuwcuXq0flbA9gwplNTYv0iyJg3wSJ_2g-tp9ftXxLd7Ek_zz8rmyjmN9Pim131swSI_S-whU_Uyda-u1tcE_U4MXvL6z6yhCE0EkBVZJWfhSTKPWyhoXxZTV2KCYlT-KYht4o7anWGQoIjDEqx8i5EgRHkUAIpLcalP9RJGyOpILvCkvvCluFYKqZ6Qy05NBoWWTb7QB314wZyUx29fkW-uuwfbQGLDGnYeV_iGAgz0CDuCnE05achJOkLGvT7jNzN2ZcOIVCSeJDPI2wikLKMhJuOeBLYRDkBrglQmH_pkU_SgDWpc1_5IzOpIRKKAX4SKQUFqoxbDqkdYSdSMjyh3K3m62bOCI3IEtjQbpKBI6skIQL38wkSxps8BKg9AtXZMmDZbqFmIPBdfU6DI365pesGUlG5rX7Zqcm4aZ7FmllBQtXZP6IVDBfUKF_3SoKCnqxuIMVI9rkuq1mbKZtzszRb5u18SJ3EyBnhBKKd3WNS0HtjAFlDCLo_pJXZPD5Iskv0_neeLVE_HymJL4Nlkce54-ZDfJ-yy9qZZZ_DqsxlVVR5zkxeKtXPxyNn96lRfjau5ljKcPRbkxsIBAxumF5portE03gEurjOasijN_TcD9RSkDLlFCWuCm-itRTRT_nt7zZF70y2ZkeWQbSVTeEjVo6U6unEDh9gqhrpCvKsQ1gbw-GDu3Jzx3ZzRp49wdakkkqzqusb3GTrrFgG-MGm01d6pRigsufLwWbq-7FZKvQMNBkUug5Fbuok9scMmxVZ_WFix3BqspEyg67hNyMqDQcardmmqXvEuUUmBdoozVVmufO47buFa74522SXKmIdAkjN1Boey847UkVB-sOgeD7s4Qeh_cQE7ccmdIgSBDFrSHOdZtTrdCX4ogt0bplqgHro10Rb0EMCjtxqI8uYHcWkQnb0EbK610Ztaab8btm-mMC9sdFnYv4LBgOHeT2yihPTaH4N5ctz5vllhQVgpn1gGllU_UtqTqboG-KLEgkJxZUZCw1o-7tCdqaCOd90cKxRX4YK0l_XUL3AQhpNGpkARKzR3F68Z2bYALLdBZBWiyxnqVAS1-QaNq7eLFisLVyWXPBP90zt6oOLdqJThZC-6rZbkhn8tPPd1iQLdE786Ck-HaeUZgNJKUzoipi-TbifTtLThpRS3lpDKk3IVHU06juF2vE0Q3EKC7ivJvJdAKZ1QglxJbWokaE9oMf6HQ828wOJEzjwDXioT2yyMvyPHuJjRI6WQogSZNPoxqyyONunNbMlhykkEJEruA4QWBm5DBoLs8RctReP0FYUuNG_ACjLs-FdYK4brmQIRlmnKo6u4KfBFBS0Rcf_vL_wMAAP__8os0VQ==

47106: rowexec: account for some additional memory used by stats collection r=rytaft a=rytaft Prior to this commit, we did not account for the memory used in the `sampleAggregator` when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs #45812 Release note: None Co-authored-by: Rebecca Taft <[email protected]>

Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None

48342: release-19.2: rowexec: account for some additional memory used by stats collection r=rytaft a=rytaft Backport 1/1 commits from #47106. /cc @cockroachdb/release --- Prior to this commit, we did not account for the memory used in the `sampleAggregator` when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs #45812 Release note: None Co-authored-by: Rebecca Taft <[email protected]>

cockroach-teamcity added this to the 20.1 milestone Mar 6, 2020

cockroach-teamcity assigned andreimatei Mar 6, 2020

rohany assigned rohany and unassigned andreimatei Mar 7, 2020

rohany mentioned this issue Mar 11, 2020

roachtest: add extra logging to alterpk-tpcc roachtest #46001

Merged

jordanlewis mentioned this issue Mar 12, 2020

20.1 release blockers list #45599

Closed

83 tasks

rohany mentioned this issue Mar 12, 2020

rowexec: panic in HashMemRowContainer.AddRowToBucket #45975

Closed

rytaft mentioned this issue Apr 7, 2020

rowexec: account for some additional memory used by stats collection #47106

Merged

yuzefovich mentioned this issue Apr 8, 2020

roachtest: reduce data size in alterpk-tpcc #47209

Merged

craig bot closed this as completed in 27debe8 Apr 8, 2020

rytaft mentioned this issue May 4, 2020

release-20.1: rowexec: account for some additional memory used by stats collection #48341

Merged

rytaft mentioned this issue May 4, 2020

release-19.2: rowexec: account for some additional memory used by stats collection #48342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: alterpk-tpcc failed #45812

roachtest: alterpk-tpcc failed #45812

cockroach-teamcity commented Mar 6, 2020

cockroach-teamcity commented Mar 7, 2020

cockroach-teamcity commented Mar 8, 2020

rohany commented Mar 9, 2020

thoszhang commented Mar 9, 2020

cockroach-teamcity commented Mar 9, 2020

rohany commented Mar 9, 2020

thoszhang commented Mar 9, 2020

cockroach-teamcity commented Mar 10, 2020

rohany commented Mar 10, 2020

cockroach-teamcity commented Mar 11, 2020

rohany commented Mar 11, 2020

cockroach-teamcity commented Mar 12, 2020

cockroach-teamcity commented Mar 13, 2020

rohany commented Mar 14, 2020

yuzefovich commented Apr 4, 2020

rytaft commented Apr 4, 2020

yuzefovich commented Apr 4, 2020

rytaft commented Apr 4, 2020

yuzefovich commented Apr 4, 2020

cockroach-teamcity commented Apr 4, 2020

cockroach-teamcity commented Apr 5, 2020

cockroach-teamcity commented Apr 6, 2020

yuzefovich commented Apr 7, 2020

cockroach-teamcity commented Apr 7, 2020

yuzefovich commented Apr 7, 2020

roachtest: alterpk-tpcc failed #45812

roachtest: alterpk-tpcc failed #45812

Comments

cockroach-teamcity commented Mar 6, 2020

cockroach-teamcity commented Mar 7, 2020

cockroach-teamcity commented Mar 8, 2020

rohany commented Mar 9, 2020

thoszhang commented Mar 9, 2020

cockroach-teamcity commented Mar 9, 2020

rohany commented Mar 9, 2020

thoszhang commented Mar 9, 2020

cockroach-teamcity commented Mar 10, 2020

rohany commented Mar 10, 2020

cockroach-teamcity commented Mar 11, 2020

rohany commented Mar 11, 2020

cockroach-teamcity commented Mar 12, 2020

cockroach-teamcity commented Mar 13, 2020

rohany commented Mar 14, 2020

yuzefovich commented Apr 4, 2020

rytaft commented Apr 4, 2020

yuzefovich commented Apr 4, 2020

rytaft commented Apr 4, 2020

yuzefovich commented Apr 4, 2020

cockroach-teamcity commented Apr 4, 2020

cockroach-teamcity commented Apr 5, 2020

cockroach-teamcity commented Apr 6, 2020

yuzefovich commented Apr 7, 2020

cockroach-teamcity commented Apr 7, 2020

yuzefovich commented Apr 7, 2020