Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpchvec/disk failed #87510

Closed
cockroach-teamcity opened this issue Sep 7, 2022 · 5 comments
Closed

roachtest: tpchvec/disk failed #87510

cockroach-teamcity opened this issue Sep 7, 2022 · 5 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 7, 2022

roachtest.tpchvec/disk failed with artifacts on master @ 2372698da1dfacb90f60c6a63f2c1298d1db16b8:

test artifacts and logs in: /artifacts/tpchvec/disk/run_1
	tpchvec.go:462,tpchvec.go:549,tpchvec.go:573,test_runner.go:897: Non-zero exit code: 1

	test_runner.go:1028,test_runner.go:927: test timed out (0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-19386

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 7, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Sep 7, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 7, 2022
@DrewKimball
Copy link
Collaborator

From teardown.log in the artifacts:

Error: [q1]: dial tcp 10.142.1.11:26257: connect: connection refused
teardown: 16:57:01 test_impl.go:323: test failure: 	tpchvec.go:462,tpchvec.go:549,tpchvec.go:573,test_runner.go:897: Non-zero exit code: 1
teardown: 16:57:02 test_impl.go:323: [not the first failure] test failure: 
	test_runner.go:1028,test_runner.go:927: test timed out (0s)
teardown: 16:57:02 test_runner.go:796: [w6] ##teamcity[testFailed name='tpchvec/disk' details='test artifacts and logs in: /artifacts/tpchvec/disk/run_1|n	tpchvec.go:462,tpchvec.go:549,tpchvec.go:573,test_runner.go:897: Non-zero exit code: 1|n|n	test_runner.go:1028,test_runner.go:927: test timed out (0s)|n' flowId='tpchvec/disk']
teardown: 16:57:02 test_runner.go:800: [w6] --- FAIL: tpchvec/disk (36027.88s)
test artifacts and logs in: /artifacts/tpchvec/disk/run_1
	tpchvec.go:462,tpchvec.go:549,tpchvec.go:573,test_runner.go:897: Non-zero exit code: 1

	test_runner.go:1028,test_runner.go:927: test timed out (0s)

teardown: 16:57:06 test_runner.go:810: [w6] ##teamcity[testFinished name='tpchvec/disk' flowId='tpchvec/disk']

From test.log in the artifacts:

06:59:04 cluster.go:2025: > ./workload run tpch --concurrency=1 --db=tpch --default-vectorize --max-ops=1 --queries=1 {pgurl:1} --enable-checks=true
16:56:34 test_runner.go:913: test timed out after 10h0m0s; check __stacks.log and CRDB logs for goroutine dumps

Example from node 1 stack trace:

* goroutine 9061 [select, 597 minutes]:
* google.golang.org/grpc/internal/transport.(*recvBufferReader).read(0xc001bbdd10, {0xc008064eb0, 0x5, 0x5})
* 	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:179 +0x90
* google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc001bbdd10, {0xc008064eb0?, 0xc0056c41c8?, 0xc004673160?})
* 	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:173 +0x16f
* google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc00716bce0, {0xc008064eb0?, 0xc0046731d8?, 0xbd15e7?})
* 	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:485 +0x32
* io.ReadAtLeast({0x5ee7de0, 0xc00716bce0}, {0xc008064eb0, 0x5, 0x5}, 0x5)
* 	GOROOT/src/io/io.go:331 +0x9a

@yuzefovich do you think it could be related to any of the refactoring you've done recently?

@yuzefovich yuzefovich self-assigned this Sep 7, 2022
@yuzefovich yuzefovich added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 and removed branch-release-21.1 labels Sep 8, 2022
craig bot pushed a commit that referenced this issue Sep 9, 2022
87563: colmem: fix some issues with the memory limiting r=yuzefovich a=yuzefovich

**colexec: some fixes of the external sort**

This commit fixes some of the relatively benign issues in the external
sort:
- previously we forgot to unset the partition info for all partitions
that are being merged as part of the repeated merging process (we would
only reset the first one because it is overwritten by the newly created
"merged" partition)
- we incorrectly estimated "min output batch size" for the repeated
merging (the calculation was as if all current partitions were being
merged rather than `n`)
- we incorrectly computed the memory size of the enqueued batch. It is
possible that the batch is a "window" or doesn't use the whole capacity,
and previously we were using the total memory footprint.  However, we
need to only include the "proportional" size according to the length of
the batch.

The issues are relatively benign since they would mostly make the verbose
logging incorrect as well as over-estimate the "max batch mem size"
(which would mean that we'd merge the partitions sooner or with
a smaller output batch size).

Release justification: low-risk bug fix.

Release note: None

**colmem: fix some issues with the memory limiting**

This commit fixes a couple of issues with how we do memory-limiting of
batches by the footprint. In particular, the allocator will now estimate
the memory footprint of a batch before allocating a new one and will
clamp the capacity so that the batch stays under the limit. Previously,
we could allocate a batch that would exceed the limit even when all
types are fixed length. This behavior has been present since long time
ago.

Additionally, this commit fixes a recent regression in how
`SetAccountingHelper` uses the capacity of the batch. Previously, if
a new batch is allocated (when variable-width types are present) and
exceeds the memory limit, then the first call to `AccountForSet` would
artificially clamp the used capacity at 1, so the batch might have a lot
of unused capacity. Now the helper will memorize the full capacity right
after the batch is allocated. This regression was introduced in a recent
refactor of the `SetAccountingHelper`. In particular, it could lead to
the external sort (which uses the ordered synchronizer internally which
uses the `SetAccountingHelper`) becoming excruciatingly slow with
"unlucky" low memory limits. The limit would be "unlucky" if it is such
that a batch with capacity `c` doesn't exceed it, but the batch with
capacity `2 * c` would exceed the limit by less than a factor of two. In
such a scenario previously we would allocate the batch of capacity
`2 * c` yet would always use only a single row (because in
`AccountForSet` we would set the max capacity at 1).

Addresses: #87510.

Release justification: bug fix.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@DrewKimball
Copy link
Collaborator

@yuzefovich should this have been closed by #87563?

@yuzefovich
Copy link
Member

I think only after the backports are merged because the regression from #85440 has already been backported.

@yuzefovich yuzefovich removed the branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 label Sep 9, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.tpchvec/disk failed with artifacts on release-22.2 @ 1bfe9bcda653f55ed3b4216610433b51b2ef0d8f:

test artifacts and logs in: /artifacts/tpchvec/disk/run_1
	tpchvec.go:462,tpchvec.go:549,tpchvec.go:573,test_runner.go:908: Non-zero exit code: 1

	test_runner.go:1039,test_runner.go:938: test timed out (0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@yuzefovich
Copy link
Member

The backports have been merged, and no new issues came up, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

3 participants