Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testutils: TestRestart flaked on master #104740

Closed
msbutler opened this issue Jun 12, 2023 · 4 comments · Fixed by #105587
Closed

testutils: TestRestart flaked on master #104740

msbutler opened this issue Jun 12, 2023 · 4 comments · Fixed by #105587
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). skipped-test T-sql-queries SQL Queries Team

Comments

@msbutler
Copy link
Collaborator

msbutler commented Jun 12, 2023

See flake here

Jira issue: CRDB-28702

@msbutler msbutler added C-test-failure Broken test (automatically or manually discovered). T-testeng TestEng Team labels Jun 12, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 12, 2023

cc @cockroachdb/test-eng

@msbutler
Copy link
Collaborator Author

msbutler added a commit to msbutler/cockroach that referenced this issue Jun 13, 2023
Informs cockroachdb#104740

Release note: None

Epic: none
msbutler added a commit to msbutler/cockroach that referenced this issue Jun 13, 2023
Informs cockroachdb#104740

Release note: None

Epic: none
craig bot pushed a commit that referenced this issue Jun 13, 2023
104768: testutils: skip TestRestart r=msbutler a=msbutler

Informs #104740

Release note: None

Epic: none

Co-authored-by: Michael Butler <[email protected]>
@rafiss rafiss added the branch-master Failures and bugs on the master branch. label Jun 13, 2023
@renatolabs
Copy link
Contributor

In both occurrences, we have this goroutine in the logs:

goroutine 59323 [select, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Wait(0xc0021ac000)
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:602 +0x125
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*RemoteFlowRunner).RunFlow.func1.3({0xc0031ba800?, 0x1?})
        github.com/cockroachdb/cockroach/pkg/sql/flowinfra/remote_flow_runner.go:123 +0x69
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

Seems like we're stuck here:

if err := r.stopper.RunAsyncTask(ctx, "flowinfra.RemoteFlowRunner: waiting for flow to finish", func(ctx context.Context) {
if shouldExit := <-waiterShouldExit; shouldExit {
return
}
f.Wait()
cleanup()
}); err != nil {

Maybe the race is that:

Not familiar with this code, so the above might not make sense. Tagging @yuzefovich as the original author.

@knz knz added T-sql-queries SQL Queries Team and removed T-testeng TestEng Team labels Jun 20, 2023
@yuzefovich yuzefovich self-assigned this Jun 20, 2023
@yuzefovich
Copy link
Member

The root cause for the hang is different - I think this goroutine is the culprit:

goroutine 59324 [select, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/rpc.(*Connection).waitOrDefault(0xc003f92c40, {0x60a67e0, 0xc00b5a6db0}, {0x0, 0x0}, {0x607e018, 0x8fc8500})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/connection.go:79 +0x305
github.com/cockroachdb/cockroach/pkg/rpc.(*Connection).ConnectNoBreaker(0xc003f92c40, {0x60a67e0, 0xc00b5a6db0})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/connection.go:145 +0x5e
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial(0xc0078a18a8, {0x60a67e0, 0xc00b5a6db0}, 0x33333333?, {0x60802a0?, 0xc00b5d20f8}, 0x0, 0x0?)
	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:175 +0x10f
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).DialNoBreaker(0xc0078a18a8, {0x60a67e0, 0xc00b5a6db0}, 0x3b9aca00?, 0x0?)
	github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:119 +0x8d
github.com/cockroachdb/cockroach/pkg/sql/execinfra.GetConnForOutbox({0x60a67e0, 0xc00b5a6db0}, {0x6069ee0, 0xc0078a18a8}, 0x4a82520?, 0x2540be400)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/outboxbase.go:47 +0x215
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run.func1({0x60a67e0, 0xc00b5a6db0}, {0x6069ee0?, 0xc0078a18a8?}, 0x1?, 0x0?, {0x60a67e0, 0xc00b5a6bd0}, 0xc00e0d7ea0, 0xc003da6750, ...)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:187 +0x65
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run(0xc003da6750, {0x60a67e0, 0xc00b5a6bd0}, {0x6069ee0, 0xc0078a18a8}, 0x4d14f60?, 0x1, 0xc00652e170, 0xc004d14fb0?)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:228 +0x336
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).setupRemoteOutputStream.func1({0x60a67e0, 0xc00b5a6bd0}, 0xc004d14f98?)
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:768 +0xac
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).accumulateAsyncComponent.func1.1()
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:730 +0x6b
created by github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).accumulateAsyncComponent.func1
	github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:728 +0xdd

I believe the regression was introduced in #99191 and should have been fixed in #104945. I'll stress the test to make sure it's not flaky anymore (so far 5 minutes with no failures).

yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Jun 26, 2023
`TestRestart` appears to not be flaky anymore. See cockroachdb#104740 (comment)
for more details.

Release note: None
@craig craig bot closed this as completed in 961a2e6 Jun 26, 2023
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). skipped-test T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants