Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig/spanconfigkvsubscriber: TestBlockedKVSubscriberDisablesQueues failed #128453

Closed
cockroach-teamcity opened this issue Aug 7, 2024 · 11 comments · Fixed by #128713
Closed
Assignees
Labels
branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.2.0-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 7, 2024

spanconfig/spanconfigkvsubscriber.TestBlockedKVSubscriberDisablesQueues failed on release-24.2.0-rc @ 3d1bcc1df630c10c4065a22fc846daa911b436d6:

Fatal error:

panic: test timed out after 41m37s
running tests:
	TestBlockedKVSubscriberDisablesQueues (40m41s)
	TestBlockedKVSubscriberDisablesQueues/queue=replicate (40m24s)

Stack:

goroutine 26335 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x265
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x45
Log preceding fatal error

* github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils.(*ConcurrentBufferGuard).syncLocked(0xc00a3231f0)
* 	github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils/concurrent_buffer_guard.go:158 +0x52
* github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils.(*ConcurrentBufferGuard).ForceSync(0xc00a3231f0)
* 	github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils/concurrent_buffer_guard.go:123 +0x87
* github.com/cockroachdb/cockroach/pkg/sql/sqlstats/insights.(*concurrentBufferIngester).Start.func2({0xc004077c08?, 0x0?})
* 	github.com/cockroachdb/cockroach/pkg/sql/sqlstats/insights/pkg/sql/sqlstats/insights/ingester.go:120 +0x65
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485 +0x263
* created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 18997
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:476 +0x69d
* 
* goroutine 24360 [select, 2 minutes]:
* database/sql.(*DB).connectionOpener(0xc00b67a750, {0xb416ac0, 0xc00b67eeb0})
* 	GOROOT/src/database/sql/sql.go:1246 +0xed
* created by database/sql.OpenDB in goroutine 18997
* 	GOROOT/src/database/sql/sql.go:824 +0x2d4
* 
* goroutine 26217 [sleep]:
* time.Sleep(0x4c4b40)
* 	GOROOT/src/runtime/time.go:195 +0x115
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce(0xc000e2f4a0, {0xb416af8, 0xc00d607230})
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:610 +0x176
* github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers.func1(0x1, 0xc000e2f4a0)
* 	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:156 +0x17f
* created by github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers in goroutine 18997
* 	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:152 +0x1cf
* 
* goroutine 26104 [select, 2 minutes]:
* github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker.func1()
* 	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:253 +0x19b
* created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker in goroutine 26031
* 	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:248 +0xad
* 
* goroutine 26105 [sync.Cond.Wait, 2 minutes]:
* sync.runtime_notifyListWait(0xc00e2f40a8, 0x1)
* 	GOROOT/src/runtime/sema.go:569 +0x159
* sync.(*Cond).Wait(0xc00e2f4098)
* 	GOROOT/src/sync/cond.go:70 +0x75
* github.com/cockroachdb/pebble/record.(*flusherCond).Wait(...)
* 	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:369
* github.com/cockroachdb/pebble/record.(*LogWriter).flushLoop(0xc00e2f4000, {0xc006b84140?, 0x7cef3a0?})
* 	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:643 +0xb79
* runtime/pprof.Do({0xb416998, 0xfffae00}, {{0xc000cecfc0?, 0x20367a0?, 0xc005c4b4b0?}}, 0xc00c3aafc0)
* 	GOROOT/src/runtime/pprof/runtime.go:51 +0x118
* github.com/cockroachdb/pebble/record.NewLogWriter.func2()
* 	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:562 +0x92
* created by github.com/cockroachdb/pebble/record.NewLogWriter in goroutine 26031
* 	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:561 +0x8e5
* 
*

Parameters:

  • attempt=1
  • race=true
  • run=1
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-41025

@cockroach-teamcity cockroach-teamcity added branch-release-24.2.0-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Aug 7, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

Looks similar to #122608

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

Seems stuck on stopping the test cluster, when the defer is executed in last line of the test:

I240807 09:44:02.045530 23670 util/log/clog.go:446 ⋮ [-] 1074

goroutine 18997 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc00417fdc8?)
        GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc00417fdc0)
        GOROOT/src/sync/waitgroup.go:116 +0xa5
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc005eb8008, {0xb416ba0, 0xfffae00})
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:160 +0x425
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func3()
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:453 +0x3d
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc0068b23f0)
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:108 +0x27
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc005ed8000, {0xb416998, 0xfffae00})
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:550 +0x25a
github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigkvsubscriber_test.TestBlockedKVSubscriberDisablesQueues.func1(0xc003c2e000)
        github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigkvsubscriber_test/pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber_client_test.go:118 +0xa99
testing.tRunner(0xc003c2e000, 0xc005046000)
        GOROOT/src/testing/testing.go:1689 +0x21f
created by testing.(*T).Run in goroutine 13785
        GOROOT/src/testing/testing.go:1742 +0x826

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

Unclear why the test reports timing out at 41 minutes. From logs, the test cluster has been running between 09:41:19.292875 and 09:44:01.843982, i.e. 2m 42s. The 2 minutes has been spent waiting for the cluster shutdown.

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 7, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

2 goroutines in stopServers trying to quiesce:

‹goroutine 26217 [running]:›
‹github.com/cockroachdb/cockroach/pkg/util/allstacks.GetWithBuf({0x0, 0xc00fe01d90?, 0x0})›
‹       github.com/cockroachdb/cockroach/pkg/util/allstacks/allstacks.go:32 +0x8d›
‹github.com/cockroachdb/cockroach/pkg/util/allstacks.Get(...)›
‹       github.com/cockroachdb/cockroach/pkg/util/allstacks/allstacks.go:19›
‹github.com/cockroachdb/cockroach/pkg/util/log.DumpStacks({0xb416af8, 0xc00d607230}, {0x84e8490, 0xc})›
‹       github.com/cockroachdb/cockroach/pkg/util/log/clog.go:444 +0x54›
‹github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce(0xc000e2f4a0, {0xb416af8, 0xc00d607230})›
‹       github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:607 +0x26a›
‹github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers.func1(0x1, 0xc000e2f4a0)›
‹       github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:156 +0x17f›
‹created by github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers in goroutine 18997›
‹       github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:152 +0x1cf›

goroutine 26217 [sleep]:›
time.Sleep(0x4c4b40)›
        GOROOT/src/runtime/time.go:195 +0x115›
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce(0xc000e2f4a0, {0xb416af8, 0xc00d607230})›
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:610 +0x176›
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers.func1(0x1, 0xc000e2f4a0)›
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:156 +0x17f›
created by github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers in goroutine 18997›
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:152 +0x1cf›

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

Presumably, NumTasks doesn't reach 0.

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

The goroutine dump is the last thing printed in the logs, but it's not a fatal message. So I think the test has really been stuck in this state for 40-ish minutes, but everything was nearly shut down, except there is some deadlock preventing all the tasks being shut down.

@pav-kv pav-kv added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed A-testing Testing tools and infrastructure labels Aug 7, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

Removing the A-testing and reinstating release-blocker label, since this can be a bug in shutting down some of the tasks. Need to look at stack traces to understand which tasks are running, or maybe add some logging to print them.

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

The only task in the stacks dump is the following, and it appears to be stuck:

goroutine 24086 [chan send, 2 minutes]:›
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/insights.newConcurrentBufferIngester.func2(0xc00a3231f0?)›
        github.com/cockroachdb/cockroach/pkg/sql/sqlstats/insights/pkg/sql/sqlstats/insights/ingester.go:208 +0x118›
github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils.(*ConcurrentBufferGuard).syncLocked(0xc00a3231f0)›
        github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils/concurrent_buffer_guard.go:158 +0x52›
github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils.(*ConcurrentBufferGuard).ForceSync(0xc00a3231f0)›
        github.com/cockroachdb/cockroach/pkg/sql/contention/contentionutils/concurrent_buffer_guard.go:123 +0x87›
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/insights.(*concurrentBufferIngester).Start.func2({0xc004077c08?, 0x0?})›
        github.com/cockroachdb/cockroach/pkg/sql/sqlstats/insights/pkg/sql/sqlstats/insights/ingester.go:120 +0x65›
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()›
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485 +0x263›
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 18997›
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:476 +0x69d›

@pav-kv pav-kv added the T-sql-queries SQL Queries Team label Aug 7, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Aug 7, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 7, 2024

@michae2 @rytaft Could you please help triage this to the right SQL team? There seems to be a deadlock in pkg/sql/sqlstats.

@rytaft
Copy link
Collaborator

rytaft commented Aug 7, 2024

This would be the observability team cc @dhartunian

@rytaft rytaft added T-observability and removed T-sql-queries SQL Queries Team labels Aug 7, 2024
@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Aug 7, 2024
@rytaft rytaft removed this from SQL Queries Aug 7, 2024
dhartunian added a commit to dhartunian/cockroach that referenced this issue Aug 9, 2024
The buffer ingester could get stuck during closing because it would
independently evaluate a `running` bool and then try to write to
a channel with a buffer of 1. This channel could have accumulated
a payload prior to the ingester shutting down and would then block
forever.

Instead of using the boolean, the flush operation now selects on a
channel that we close on shutdown. This ensure that no matter how many
extra flushes we have blocked, they will all get cancelled.

Resolves: cockroachdb#128453
Epic: None

Release note: None
craig bot pushed a commit that referenced this issue Aug 13, 2024
128713: sqlstats: add close chan to buffer ingester r=xinhaoz a=dhartunian

The buffer ingester could get stuck during closing because it would independently evaluate a `running` bool and then try to write to a channel with a buffer of 1. This channel could have accumulated a payload prior to the ingester shutting down and would then block forever.

Instead of using the boolean, the flush operation now selects on a channel that we close on shutdown. This ensure that no matter how many extra flushes we have blocked, they will all get cancelled.

Resolves: #128453
Epic: None

Release note: None

Co-authored-by: David Hartunian <[email protected]>
@craig craig bot closed this as completed in 6e1ead5 Aug 13, 2024
Copy link

blathers-crl bot commented Aug 13, 2024

Based on the specified backports for linked PR #128713, I applied the following new label(s) to this issue: branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 label Aug 13, 2024
blathers-crl bot pushed a commit that referenced this issue Aug 13, 2024
The buffer ingester could get stuck during closing because it would
independently evaluate a `running` bool and then try to write to
a channel with a buffer of 1. This channel could have accumulated
a payload prior to the ingester shutting down and would then block
forever.

Instead of using the boolean, the flush operation now selects on a
channel that we close on shutdown. This ensure that no matter how many
extra flushes we have blocked, they will all get cancelled.

Resolves: #128453
Epic: None

Release note: None
@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
dhartunian added a commit to dhartunian/cockroach that referenced this issue Oct 15, 2024
The buffer ingester could get stuck during closing because it would
independently evaluate a `running` bool and then try to write to
a channel with a buffer of 1. This channel could have accumulated
a payload prior to the ingester shutting down and would then block
forever.

Instead of using the boolean, the flush operation now selects on a
channel that we close on shutdown. This ensure that no matter how many
extra flushes we have blocked, they will all get cancelled.

Resolves: cockroachdb#128453
Epic: None

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.2.0-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability
Projects
No open projects
Status: roachtest/unit test backlog
Development

Successfully merging a pull request may close this issue.

4 participants