Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpch_concurrency failed #79469

Closed
cockroach-teamcity opened this issue Apr 5, 2022 · 6 comments · Fixed by #79716
Closed

roachtest: tpch_concurrency failed #79469

cockroach-teamcity opened this issue Apr 5, 2022 · 6 comments · Fixed by #79716
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 5, 2022

roachtest.tpch_concurrency failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/tpch_concurrency/run_1
	test_runner.go:1006,test_runner.go:905: test timed out (18h0m0s)
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-14877

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 5, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Apr 5, 2022
@yuzefovich
Copy link
Member

I think we're stuck on query 13 on node 1 trying to acquire the file descriptors from the semaphore:

* goroutine 772856 [select, 992 minutes]:
* github.com/marusama/semaphore.(*semaphore).Acquire(0xc0023a4420, {0x6282bf8, 0xc01ed08ae0}, 0x10)
* 	github.com/marusama/semaphore/external/com_github_marusama_semaphore/semaphore.go:120 +0x1b7
* github.com/cockroachdb/cockroach/pkg/sql/colflow.(*countingSemaphore).Acquire(0xc02dd615a0, {0x6282bf8, 0xc01ed08ae0}, 0x10)
* 	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:79 +0x38
* github.com/cockroachdb/cockroach/pkg/sql/colexec.(*hashBasedPartitioner).Next(0xc0158bd1e0)
* 	github.com/cockroachdb/cockroach/pkg/sql/colexec/hash_based_partitioner.go:440 +0xedc
* github.com/cockroachdb/cockroach/pkg/sql/colexec.(*diskSpillerBase).Next(0xc00219f830)
* 	github.com/cockroachdb/cockroach/pkg/sql/colexec/disk_spiller.go:211 +0x142
* github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*simpleProjectOp).Next(0xc0040df860)
* 	github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:125 +0x3f
* github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).next(...)
* 	github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:99
* github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc00e45c480)
* 	github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x62
* github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).Next(0xc01cf09c00)
* 	github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:107 +0x4f
* github.com/cockroachdb/cockroach/pkg/sql/colexec.getNext_false(0xc004916a00)
* 	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/colexec/hash_aggregator.eg.go:544 +0xef
* github.com/cockroachdb/cockroach/pkg/sql/colexec.(*hashAggregator).Next(0xc0038ee900)
* 	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/colexec/hash_aggregator.eg.go:82 +0x36
* github.com/cockroachdb/cockroach/pkg/sql/colexec.(*diskSpillerBase).Next.func1()
* 	github.com/cockroachdb/cockroach/pkg/sql/colexec/disk_spiller.go:198 +0x2f
* github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0x0)
...

We have seen this in a customer setting, but the issue was fixed with the cancellation fixes several months ago. I wonder we' are in a legitimate livelock or whether we again have cancellation issues (like #79084 is somewhat plausible culprit). My guess it's the former, so we might want to just bump the default from 256 FDs to say 1024.

I don't think it's a release blocker, so I'm removing the corresponding label.

@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 6, 2022
@yuzefovich yuzefovich self-assigned this Apr 6, 2022
@yuzefovich
Copy link
Member

Indeed, I think #79084 introduced a regression with cancellation. I have a fix. Adding a release-blocker label to 22.1 and 21.2 branches.

@yuzefovich yuzefovich added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-21.2 branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Apr 9, 2022
@rytaft
Copy link
Collaborator

rytaft commented Apr 9, 2022

Could you say a bit more about how you determined that #79084 caused this?

@yuzefovich
Copy link
Member

yuzefovich commented Apr 9, 2022

I think I have a decent explanation in the commit message in #79716, copy-pasting (with some edits) for convenience.

#79084 fixed the way inboxes handle regular query errors so that now the gRPC
streams are not broken whenever a query error is encountered. However, that
change introduced a regression - it is now possible that the inbox handler
goroutine (the one instantiated to handle FlowStream gRPC call) never exits
when the inbox is an input to the parallel unordered synchronizer.

In particular, the following sequence of events can happen:

  1. the reader goroutine of the inbox receives an error from the
    corresponding outbox
  2. this error is propagated to one of the input goroutines of the
    unordered synchronizer via a panic. Notably, this is considered
    a "graceful" termination from the perspective of the gRPC stream
    handling, so the handler goroutine is not notified of this error, and
    the inbox is not closed. It is expected that the inbox will be drained
    which will close the handler goroutine.
  3. however, the synchronizer input goroutine currently simply receives
    the error, propagates it to the coordinator goroutine, and exits,
    without performing the draining.

Thus, we get into a state that the inbox is never drained, so the
handler goroutine will stay alive forever. In particular, this will
block Flow.Wait calls, and Flow.Cleanup will never be called. This
could lead, for example, to leaking file descriptors used by the
temporary disk storage in the vectorized engine.

What this test encountered was exactly the leak of file descriptors.

@rytaft
Copy link
Collaborator

rytaft commented Apr 11, 2022

Thanks for the explanation. Should we revert the corresponding PR on 22.1 as well? I think any more comprehensive fix will likely need to wait until 22.1.1.

@yuzefovich
Copy link
Member

Yes, I think so. It didn't seem as urgent on Friday, so I delayed the revert till today.

@yuzefovich yuzefovich removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-21.2 branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Apr 11, 2022
@craig craig bot closed this as completed in e4a9f66 Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants