Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: alterpk-tpcc failed #45812

Closed
cockroach-teamcity opened this issue Mar 6, 2020 · 64 comments · Fixed by #47209
Closed

roachtest: alterpk-tpcc failed #45812

cockroach-teamcity opened this issue Mar 6, 2020 · 64 comments · Fixed by #47209
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).alterpk-tpcc failed on master@954fe69d554162aec0fbc001aad1fe5103d8df13:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200306-1790595/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 6, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Mar 6, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@752dea867f3aeb142e98c22f8d320ce19041aa8d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200307-1793930/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@rohany rohany assigned rohany and unassigned andreimatei Mar 7, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@dfa5bd527ae7d7373dd03c62118df87a87a77130:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200308-1795062/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@rohany
Copy link
Contributor

rohany commented Mar 9, 2020

cc @lucy-zhang is this failure related to the new sc + jobs PR? Looking at the logs, it seems like there is some deadlock while trying to run the schema change. Each node has these repeated logs:

I200308 08:44:55.878309 181 sql/schema_changer.go:973  [n1,scExec] schema change on customer (60 v11) mutation 2 starting execution...
I200308 08:44:55.878579 181 sql/schema_changer.go:1025  [n1,scExec] schema change on customer (60 v11) mutation 2: another node is currently operating on this table

and no progress is made for 10 hours

@thoszhang
Copy link
Contributor

954fe69 doesn't have those changes (and I rolled them back on master anyway, for the time being, to deal with a problem with migrations).

That log message is coming from here:

"schema change on %s (%d v%d) mutation %d: another node is currently operating on this table",

So it's a problem with trying to acquire a schema change lease, which hopefully should be solved once I do merge that PR. (I don't know what caused the deadlock, though.) Let's stress this test again once that PR is merged.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@c473f40078994551cebcbe00fdbf1fa388957658:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200309-1796240/alterpk-tpcc/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@rohany
Copy link
Contributor

rohany commented Mar 9, 2020

I'm investigating this issue a little more and don't understand whats happening in the logs.

We see here

I200309 07:40:00.803524 188887 sql/backfill.go:176  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] Running backfill for "customer", v=11, m=2

That a backfill has started for our index. However, after that we see that all the nodes start trying to grab the lease:

Then, we see that validation for the index backfills have completed:

I200309 08:35:16.147475 1091744 sql/backfill.go:1332  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: index customer/customer_c_w_id_c_d_id_c_id_key row count = 15000000, time so far 20.41641794s
I200309 08:35:17.720733 1091745 sql/backfill.go:1332  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: index customer/customer_idx_rewrite_for_primary_key_change row count = 15000000, time so far 21.988191096s
I200309 08:35:18.303016 28 server/status/runtime.go:498  [n1] runtime stats: 8.7 GiB RSS, 485 goroutines, 224 MiB/5.1 GiB/1.6 GiB GO alloc/idle/total, 4.3 GiB/5.5 GiB CGO alloc/total, 2577.0 CGO/sec, 326.6/11.7 %(u/s)time, 0.0 %gc (19x), 781 MiB/3.5 MiB (r/w)net
I200309 08:35:21.004309 1091746 sql/backfill.go:1376  [n1,client=35.237.131.38:45106,hostnossl,user=root,scExec] validation: table customer row count = 15000000, took 25.276541782s

However, it then doesn't look like we make it past SchemaChanger.done, because we then see these log messages:

I200309 07:41:06.716590 30 sql/schema_changer.go:973  [n1,scExec] schema change on customer (59 v11) mutation 2 starting execution...
I200309 07:41:06.728168 30 sql/schema_changer.go:1025  [n1,scExec] schema change on customer (59 v11) mutation 2: another node is currently operating on this table

But it also doesn't look like we started to rollback the schema change either.

@thoszhang
Copy link
Contributor

@rohany Now that the new schema change job is merged, I think we should wait a few days and see if we get any more failures. The thing that's happening here (being unable to acquire a schema change lease) shouldn't be possible anymore, since schema change leases won't exist, though maybe there's some deeper underlying problem that would also cause problems with the new schema change job.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@72c4a1bd411f2f82bf9aaa22883821a946614148:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200310-1799071/alterpk-tpcc/run_1
	test_runner.go:779: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@rohany
Copy link
Contributor

rohany commented Mar 10, 2020

I think this failure above still doesn't have the thew sc jobs commit yet.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@5570c01402796edb7cd06eb8ce7f615371f22d42:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200311-1801614/alterpk-tpcc/run_1
	test_runner.go:779: test timed out (10h0m0s)

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

rohany added a commit to rohany/cockroach that referenced this issue Mar 11, 2020
This PR adds some extra logging to the backfiller
and alterpk roachtest to have more insight on where cockroachdb#45812
is failing.

Release justification: non production code change
Release note: None
@rohany
Copy link
Contributor

rohany commented Mar 11, 2020

Its unclear from these logs what exactly is going wrong, though the situation still seems similar to the logs before the schema changes jobs pr. One thing is that 500 warehouses is probably too big to be running on a 3 node 4 cpu cluster, but it doesn't explain the hang.

craig bot pushed a commit that referenced this issue Mar 11, 2020
46001: roachtest: add extra logging to alterpk-tpcc roachtest r=yuzefovich a=rohany

This PR adds some extra logging to the backfiller
and alterpk roachtest to have more insight on where #45812
is failing.

Release justification: non production code change
Release note: None

Co-authored-by: Rohan Yadav <[email protected]>
@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@793a9200c16693aff32aa6a4dd9d8bbcbddb30aa:

		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_082527.690_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1804460-1583998462-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200312 08:25:29.248938 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 27.383279ms
		    I200312 08:25:37.613422 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 8.364418946s
		    I200312 08:25:39.721125 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.10765782s
		    I200312 08:26:23.891371 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 44.170123384s
		    I200312 08:28:16.731792 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 1m52.840272696s
		    I200312 08:35:42.761535 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 7m26.029484569s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:801: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1804460-1583998462-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4164
		2: 4753
		1: dead
		Error:  1: dead

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@69dc87d68addedf2fabfb2b14c098cfb35b5f3d0:

		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_074140.622_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:162
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1806460-1584083876-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200313 07:41:42.211690 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 34.938153ms
		    I200313 07:41:49.842967 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 7.631204223s
		    I200313 07:41:51.237037 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.393985549s
		    I200313 07:43:56.072945 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 2m4.835669018s
		    I200313 07:45:41.145017 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 1m45.07188352s
		    I200313 09:04:01.195185 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 1h18m20.049886518s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error:  exit status 1
		    
		    stdout::
		  - exit status 1

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:807: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1806460-1584083876-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 5069
		3: 4474
		1: dead
		Error:  1: dead

More

Artifacts: /alterpk-tpcc

See this test on roachdash
powered by pkg/cmd/internal/issues

@rohany
Copy link
Contributor

rohany commented Mar 14, 2020

Hmm, latest failure just looks like a random node failure. Haven't seen the deadlock in the past 2 runs...

@yuzefovich
Copy link
Member

Eh, the OOM can be reliably reproduced on a 3 node roachprod cluster with default n1-standard-4 nodes after importing 500 warehouses and running the query 3.3.2.6 manually (so the alteration of primary key is not needed). I tried to add more logging to try to understand where we're not accounting for the memory properly but still don't see it.

Here is an explain analyze (on an occasion it didn't crash) of the query for the context on amount of data flying around:
Screen Shot 2020-04-03 at 8 13 10 PM

We seem to be estimating the memory usage of the aggregators relatively good now. For example, here is one of the heap profiles when first stage aggregator finishes and the second one is about to:
Screen Shot 2020-04-03 at 8 22 32 PM
And at about this time in the logs we have:

[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-5: bytes usage increases to 1.0 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-3, bytes usage max 170 MiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-3, bytes usage max 1.9 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-5: bytes usage increases to 128 MiB (+10240)

In all of the crashes I've observed we successfully get to the point where hash join is being performed and actually get pretty far in the join's execution. However, RSS jumps quite significantly (say from 13 GiB RSS, 203 goroutines, 7.3 GiB/2.5 GiB/10 GiB GO alloc/idle/total to 14 GiB RSS, 201 goroutines, 9.2 GiB/720 MiB/10 GiB GO alloc/idle/total in 10 seconds), but this jump is not reflected in the heap profiles. It is possible that OS reports RSS with a delay, but I'm very puzzled by this.

I'm tired of looking at this issue and wondering whether someone else should take a stab at it. Here are a few suggestions:

  1. I added some logging and pushed the changes to https://github.com/yuzefovich/cockroach/tree/debug-alterpk-tpcc
  2. before running the query I disable auto stats and set some verbosity (on each node):
set cluster setting sql.stats.automatic_collection.enabled=false;
select crdb_internal.set_vmodule('hashjoiner=1,bytes_usage=1,hash_row_container=2,row_based_flow=1');
  1. I start Cockroach cluster with custom memory profile interval
roachprod start yahor-test:1-3 --env="COCKROACH_MEMPROF_INTERVAL=3s"

@rytaft
Copy link
Collaborator

rytaft commented Apr 4, 2020

Hi @yuzefovich -- sorry for my slow reply to you earlier -- since you've disabled auto stats, does this mean you've ruled out histograms as the cause? (You can also explicitly disable histograms using set cluster setting sql.stats.histogram_collection.enabled = false;.)

@yuzefovich
Copy link
Member

I still think there is memory misaccounting while performing automatic stats collection, but the crashes occur (seems like with lower likelihood) with auto stats disabled, so I don't think they are the root of cause of these OOMs.

@rytaft
Copy link
Collaborator

rytaft commented Apr 4, 2020

Ok thanks! I'll investigate the memory accounting of stats collection on Monday (unless there is some urgency to do it sooner...).

@yuzefovich
Copy link
Member

Thanks! I don't think it's urgent since it seems to me that generally we're accounting for "permanently" used memory while automatically collecting stats, but I'm concerned about "temporary" allocations that could spike - I just don't know how big the spikes can be.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@82fec00c83d4bfe35b906264ccb568568cec15b7:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_082033.550_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1851668-1585985862-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200404 08:20:35.396562 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 94.007016ms
		    I200404 08:20:43.585080 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 8.188439543s
		    I200404 08:20:46.075089 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.489950798s
		    I200404 08:22:28.782320 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m42.707109563s
		    I200404 08:24:41.019144 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 2m12.236664801s
		    I200404 08:32:19.429694 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 7m38.410327593s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_080522.569_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1853455-1586071414-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200405 08:05:24.183469 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 53.342763ms
		    I200405 08:05:31.785387 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 7.601842022s
		    I200405 08:05:34.392096 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.606617715s
		    I200405 08:06:40.709953 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m6.317813813s
		    I200405 08:08:40.770217 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 2m0.060106663s
		    I200405 08:16:52.169000 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 8m11.398593236s
		    Error: check failed: 3.3.2.6: pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@beac4a53e0e2e2236eb5957f67abc1bf476ad1b6:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_075641.938_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1854541-1586157738-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200406 07:56:43.703421 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 66.908502ms
		    I200406 07:56:49.005046 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 5.30152617s
		    I200406 07:56:51.492729 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.487643063s
		    I200406 07:58:07.460154 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m15.967279962s
		    I200406 08:00:05.841061 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 1m58.380748316s
		    I200406 11:12:44.203253 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3h12m38.361899276s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

rytaft added a commit to rytaft/cockroach that referenced this issue Apr 7, 2020
Prior to this commit, we did not account for the memory used in the
sampleAggregator when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs cockroachdb#45812

Release note: None
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 7, 2020
Prior to this commit, we did not account for the memory used in the
sampleAggregator when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs cockroachdb#45812

Release note: None
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 7, 2020
Prior to this commit, we did not account for the memory used in the
sampleAggregator when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs cockroachdb#45812

Release note: None
rytaft added a commit to rytaft/cockroach that referenced this issue Apr 7, 2020
Prior to this commit, we did not account for the memory used in the
sampleAggregator when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs cockroachdb#45812

Release note: None
@yuzefovich
Copy link
Member

Unfortunately, this crash seems like a regression. Using 19.2.5 I created 3 node cluster with default node sizes, imported 500 warehouses, waited for auto stats to be collected, and then ran first query from 3.3.2.6 check manually twice, and both times it succeeded in about 4 minutes. Then I stopped the cluster and restarted it using the current master, but when I executed the query, it crashed on the first try. This issue deserves more investigation. cc @jordanlewis

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc failed on master@2032dafccfa311c7538960e974953cb9dc1d4e50:

		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_081029.241_n4_workload_check_tpcc:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.(*cluster).Run
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
		    main.registerAlterPK.func3
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:160
		    main.(*testRunner).runTest.func2
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1857109-1586244496-17-n4cpu16:4 -- ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1} returned:
		    stderr:
		    I200407 08:10:30.874469 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 45.088823ms
		    I200407 08:10:41.353822 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 10.479249493s
		    I200407 08:10:43.551308 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 2.19743782s
		    I200407 08:11:54.442385 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m10.891018778s
		    I200407 08:13:56.825404 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 2m2.382769241s
		    I200407 08:17:27.545161 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3m30.719600577s
		    Error: check failed: 3.3.2.6: driver: bad connection
		    Error: DEAD_ROACH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./workload check tpcc --warehouses 500 --expensive-checks {pgurl:1}
		        ```
		      - exit status 1
		    
		    stdout::
		  - exit status 30

More

Artifacts: /alterpk-tpcc
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

Here is EXPLAIN ANALYZE of the query on 19.2.5 version:
https://cockroachdb.github.io/distsqlplan/decode.html#eJzEmG1P20oWx9_vp7DmFaxMPOfM44m0EtyWvYuWJl2g2tvdi6IU-0LUJGZto1tU9buv7ISS2GMzeSh904LtmfnPf86c3zl8Zfn_pqzPTn97f35yNghOBifnH_9zGhy8Pbu8uvzX-WFwcHl6fvrmKkhHf44mcRiko3j5_-LfdDq6mRfB3y-G74LfWZrFSfY7C4YXb08vgl8-OocFb08v3xwGp7-9OX1_FZycnz8vMn36fPr0_XS0WOkmfZgXB389XKxUrTOaTuZJ8OvF8MP7cqmD1tGHK3ravlmIYiGbp3EyGM-SnPX_y4CFDFnIBLsO2X2W3iR5nmblq6_Vh2fxF9bnIZvM7x-K8vF1yG7SLGH9r6yYFNOE9dnV-NM0uUjGcZJFnIUsTorxZFpNf59NZuPs8bjaDgvZ5f14nveDo4hHEAnOeTCex4EM0uIuyXIWsuFD0Q-ORXiM4TGEx4aFLEv_zIMsGcf9QEnJSZR682I8nQbFZJb0A4CeRSm1Kif49FgkT5-jNMG7yS_s-lvI0odiuYNn4Z8eg7txfrcuuVwWw1KCZNffrqulbhPWh2_hdpaI9fmfT3aUF-nN59Efn0eT-Mvx8_NVnyhCbgRGFJlIII90ZRjoumNL0et2SZLKKC5rfknbI6sMmLpf0LPBrw2_ni3ATSw4ub3NkttxkWaRdjvMQnYy-DgaDK9Ggw_n5wfHcNh4hM1Honz0ZvhhcDW6GP778uDQa9Mww54lZQ2Wu56NvwSzZJZmj8FDnsT9QAvaPlTW4kS0mvQ828O8Ou0kXpuvmqXDRtq_jZcf3o3OBlcHx_Kweflk7fIJSYrbmq84A-ihAEBwGyv5_u6g3MHbcCv7_zHO7y6TYngfAa5rvHq8T_orOZ6FbJr8URysaD_8Wza5vVt_VL-xC5vLkcGq15yDBf70Zi2QRU9IFFyWdlfzuwcuXq0flbA9gwplNTYv0iyJg3wSJ_2g-tp9ftXxLd7Ek_zz8rmyjmN9Pim131swSI_S-whU_Uyda-u1tcE_U4MXvL6z6yhCE0EkBVZJWfhSTKPWyhoXxZTV2KCYlT-KYht4o7anWGQoIjDEqx8i5EgRHkUAIpLcalP9RJGyOpILvCkvvCluFYKqZ6Qy05NBoWWTb7QB314wZyUx29fkW-uuwfbQGLDGnYeV_iGAgz0CDuCnE05achJOkLGvT7jNzN2ZcOIVCSeJDPI2wikLKMhJuOeBLYRDkBrglQmH_pkU_SgDWpc1_5IzOpIRKKAX4SKQUFqoxbDqkdYSdSMjyh3K3m62bOCI3IEtjQbpKBI6skIQL38wkSxps8BKg9AtXZMmDZbqFmIPBdfU6DI365pesGUlG5rX7Zqcm4aZ7FmllBQtXZP6IVDBfUKF_3SoKCnqxuIMVI9rkuq1mbKZtzszRb5u18SJ3EyBnhBKKd3WNS0HtjAFlDCLo_pJXZPD5Iskv0_neeLVE_HymJL4Nlkce54-ZDfJ-yy9qZZZ_DqsxlVVR5zkxeKtXPxyNn96lRfjau5ljKcPRbkxsIBAxumF5portE03gEurjOasijN_TcD9RSkDLlFCWuCm-itRTRT_nt7zZF70y2ZkeWQbSVTeEjVo6U6unEDh9gqhrpCvKsQ1gbw-GDu3Jzx3ZzRp49wdakkkqzqusb3GTrrFgG-MGm01d6pRigsufLwWbq-7FZKvQMNBkUug5Fbuok9scMmxVZ_WFix3BqspEyg67hNyMqDQcardmmqXvEuUUmBdoozVVmufO47buFa74522SXKmIdAkjN1Boey847UkVB-sOgeD7s4Qeh_cQE7ccmdIgSBDFrSHOdZtTrdCX4ogt0bplqgHro10Rb0EMCjtxqI8uYHcWkQnb0EbK610Ztaab8btm-mMC9sdFnYv4LBgOHeT2yihPTaH4N5ctz5vllhQVgpn1gGllU_UtqTqboG-KLEgkJxZUZCw1o-7tCdqaCOd90cKxRX4YK0l_XUL3AQhpNGpkARKzR3F68Z2bYALLdBZBWiyxnqVAS1-QaNq7eLFisLVyWXPBP90zt6oOLdqJThZC-6rZbkhn8tPPd1iQLdE786Ck-HaeUZgNJKUzoipi-TbifTtLThpRS3lpDKk3IVHU06juF2vE0Q3EKC7ivJvJdAKZ1QglxJbWokaE9oMf6HQ828wOJEzjwDXioT2yyMvyPHuJjRI6WQogSZNPoxqyyONunNbMlhykkEJEruA4QWBm5DBoLs8RctReP0FYUuNG_ACjLs-FdYK4brmQIRlmnKo6u4KfBFBS0Rcf_vL_wMAAP__8os0VQ==

craig bot pushed a commit that referenced this issue Apr 7, 2020
47106: rowexec: account for some additional memory used by stats collection r=rytaft a=rytaft

Prior to this commit, we did not account for the memory used in the
`sampleAggregator` when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs #45812

Release note: None

Co-authored-by: Rebecca Taft <[email protected]>
@craig craig bot closed this as completed in 27debe8 Apr 8, 2020
rytaft added a commit to rytaft/cockroach that referenced this issue May 4, 2020
Prior to this commit, we did not account for the memory used in the
sampleAggregator when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs cockroachdb#45812

Release note: None
rytaft added a commit to rytaft/cockroach that referenced this issue May 4, 2020
Prior to this commit, we did not account for the memory used in the
sampleAggregator when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs cockroachdb#45812

Release note: None
craig bot pushed a commit that referenced this issue May 4, 2020
48342: release-19.2: rowexec: account for some additional memory used by stats collection r=rytaft a=rytaft

Backport 1/1 commits from #47106.

/cc @cockroachdb/release

---

Prior to this commit, we did not account for the memory used in the
`sampleAggregator` when we copy all samples into a new slice before
generating a histogram. This commit adds some additional memory
accounting for this overhead.

Informs #45812

Release note: None


Co-authored-by: Rebecca Taft <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants