c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350

msbutler · 2023-09-11T16:12:48Z

See CI failure here.. I think the relevant goroutine points to a slow rangefeed.Close() in the c2c eventStream.

goroutine 40449 [chan receive, 14 minutes]:
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).Close.func1()
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:258 +0x3c
sync.(*Once).doSlow(0x0?, 0x0?)
	GOROOT/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	GOROOT/src/sync/once.go:65
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).Close(0xc00b68c220?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:256 +0x47
github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer.(*eventStream).Close(0xc00b401c00, {0x7a393b0, 0xc0099c34a0})
	github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer/event_stream.go:234 +0x4d
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*projectSetProcessor).close(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:345 +0x9a
github.com/cockroachdb/cockroach/pkg/sql/rowexec.newProjectSetProcessor.func1()
	github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:105 +0x1d
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).moveToTrailingMeta(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:689 +0x3b8
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).DrainHelper(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:570 +0x196
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*projectSetProcessor).Next(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:328 +0x1f4
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).DrainHelper(0xc007076000)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:566 +0x12e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).DrainMeta(0xc007076000)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:285 +0x211
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*invariantsChecker).DrainMeta(0xc00c3ea800)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:119 +0x3c
github.com/cockroachdb/cockroach/pkg/sql/colexecop.MetadataSources.DrainMeta.func1()
	github.com/cockroachdb/cockroach/pkg/sql/colexecop/operator.go:464 +0x6b
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc011feada0?)
	github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:92 +0x62
github.com/cockroachdb/cockroach/pkg/sql/colexecop.MetadataSources.DrainMeta({0xc00ee7c140?, 0x7a78800?, 0xc00d59fd40?})
	github.com/cockroachdb/cockroach/pkg/sql/colexecop/operator.go:462 +0x8a
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).Run(0xc00eee0fa0, {0x7a39308?, 0xc009f94780?})
...

Jira issue: CRDB-31392

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-09-11T16:12:51Z

cc @cockroachdb/disaster-recovery

msbutler · 2023-09-11T16:13:37Z

@miretskiy do you know of any recent rangefeed work that could have caused this?

msbutler · 2023-09-12T00:44:11Z

aha, I thought of one theory (without evidence) for why this could be occurring: this unit test doesn't manually shut down the replication stream, so maybe some context cancellation wires are getting crossed somewhere.

…_configs This small patch prevents a slow quiesce failure during test shutdown by manually closing the replication stream before test shutdown. cockroachdb#110350 observed this failure because the eventStream's underlying rangefeed would not close, implying there may still be some sort of deadlock within Rangefeed.Close(). This patch merely deflakes the TestDataDriven/initial_scan_span_configs test. Informs cockroachdb#110350 Release note: None

msbutler · 2023-09-12T20:39:51Z

I can repro this failure via

git checkout 746388795af76a43cc21d072d06bc646b8179374
./dev test pkg/ccl/streamingccl/streamingest -f TestDataDriven/initial_scan_span_configs --stress --timeout 120s

msbutler · 2023-09-12T20:41:29Z

Opened #110504 to deflake this test, but there still may be some deadlock problem in rangefeed.Close()

miretskiy · 2023-09-12T20:51:12Z

@miretskiy do you know of any recent rangefeed work that could have caused this?

There is plenty; but not where you're seeing this. This seems to be the rangefeed library (not to be confused with the "dist sender rangefeed") that seems stuck.

miretskiy · 2023-09-12T20:52:46Z

This happen after mux was turned on?

msbutler · 2023-09-12T20:54:22Z

I see a mux rangefeed goroutine in the dump....so, yes?

* goroutine 29278 [semacquire, 2 minutes]:
* sync.runtime_Semacquire(0xc00335f860?)
* 	GOROOT/src/runtime/sema.go:62 +0x27
* sync.(*WaitGroup).Wait(0xb470301?)
* 	GOROOT/src/sync/waitgroup.go:116 +0x4b
* golang.org/x/sync/errgroup.(*Group).Wait(0xc00bf42700)
* 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:53 +0x27
* github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.Wait({0xc00bf42700?, {0x7a39308?, 0xc00fa9dae0?}})
* 	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:144 +0x4a
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.muxRangeFeed({0x7a393b0, 0xc00103d7d0}, {0x1, 0x1, 0x1, 0x0, {0x0, 0x0, 0x0}}, {0xc00bf42680, ...}, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:102 +0x490
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeedSpans(0xc00f33d600, {0x7a39308, 0xc00fa9d810}, {0xc00bf42680?, 0x1, 0x1}, 0xc0099b7bc0, {0xc00bf42300, 0x3, 0x4})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:234 +0x490
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeed(0x582ea80?, {0x7a39308, 0xc00fa9d810}, {0xc00b737080, 0x1, 0x2?}, {0xd04b5c?, 0x0?, 0x0?}, 0xc0099b7bc0, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:183 +0x29a
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*dbAdapter).RangeFeed(0x1e3b040?, {0x7a39308?, 0xc00fa9d810?}, {0xc00b737080?, 0xc009a47fd0?, 0x1269c26?}, {0x7a393b0?, 0xb7371a0?, 0xc0?}, 0xc0099b7bc0, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/db_adapter.go:79 +0x65
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).run.func1({0x7a39308?, 0xc00fa9d810?})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:314 +0xb2
* github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
* 	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
* golang.org/x/sync/errgroup.(*Group).Go.func1()
* 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
* created by golang.org/x/sync/errgroup.(*Group).Go
* 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5

miretskiy · 2023-09-12T20:54:33Z

Just guessing here:

	if err := f.stopper.RunAsyncTask(ctx, "rangefeed", runWithFrontier); err != nil {
		f.cancel()
		return err
	}

(that's in the rangefeed.go... Shouldn't we use WithCancelOnQuiesce here?)

dt · 2023-09-13T16:22:44Z

@miretskiy we do don't we? one line above?

cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go

Lines 244 to 248 in 00cbb30

    
           ctx, f.cancel = f.stopper.WithCancelOnQuiesce(ctx) 
        
           if err := f.stopper.RunAsyncTask(ctx, "rangefeed", runWithFrontier); err != nil { 
        
           	f.cancel() 
        
           	return err 
        
           }

msbutler · 2023-09-13T16:27:35Z

@dt writing something up now and plan to hand off to kv and the rangefeed scalability team

msbutler · 2023-09-13T16:54:27Z

Handing this off to kv and posting in the rangefeed-scalablity channel about this.

To summarize: when the test ends, there's an active c2c job running. When the test runner shuts down a test server, the test occasionally fails because of a slow quiesce message. To repro the slowness (without waiting for the quiesce timeout), run this.

The goroutine dump points to rangefeed.Close hanging here:

cockroach/pkg/ccl/streamingccl/streamproducer/event_stream.go

Line 234 in 3722fc1

s.rf.Close()

Looking a little further in the rangefeed client library, the goroutine dump indicates we're waiting in rangefeed.Close() for this stopper channel:

cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go

Line 258 in 5c2f80c

<-f.stopped

That stopper channel closes when rangefeed.run() returns, but the dump indicates that it has not returned as its waiting on the processEvents goroutine to finish.

cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go

Line 320 in 5c2f80c

err := ctxgroup.GoAndWait(ctx, rangeFeedTask, processEventsTask)

example stack in the dump

goroutine 428 [sync.Cond.Wait, 1 minutes]:
sync.runtime_notifyListWait(0xc006065390, 0x1)
  GOROOT/src/runtime/sema.go:527 +0x14c
sync.(*Cond).Wait(0xc000c6c780?)
  GOROOT/src/sync/cond.go:70 +0x8c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Scheduler).processEvents(0xc0063e20c0, {0x7a4e820, 0xc0010f98c0})
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduler.go:295 +0x4a6
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Scheduler).Start.func1({0x7a4e820, 0xc0010f98c0})
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduler.go:162 +0x11f
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

Figuring out why processEvents is hanging is beyond my pay grade, so I'm handing over to kv.

Some things to note:

I don't think this is a mux rangefeed problem. If I set changefeed.mux_rangefeed.enabled to false during test cluster startup, the test still hangs. See test modification here.
if the c2c shuts job down cleanly via job completion, I don't see this hanging behavior. I may deflake this test by cleanly shutting down the job. Obviously this patch would not solve this bug here.

dt · 2023-09-13T19:49:48Z

sync.Cond.Wait

Is this just the usual "condition variables don't compose with context cancellation which is why we don't use them" issue?

miretskiy · 2023-09-13T20:00:07Z

sync.Cond.Wait

Is this just the usual "condition variables don't compose with context cancellation which is why we don't use them" issue?

worth checking -- esp since this is new code; but I'm not sure that's the case. it's just a standard producer/consumer
patter where consumer (processEvents) is getting signaled if the processor stopped.

blathers-crl · 2023-09-14T14:49:37Z

cc @cockroachdb/replication

erikgrinaker · 2023-09-14T14:54:40Z

Why is this using the rangefeed scheduler at all? It's disabled by default, and I don't believe we've enabled any metamorphism. Did y'all explicitly enable it or something?

msbutler · 2023-09-14T15:10:16Z

I don't believe we explicitly enabled this. Here's where we set all of our cluster settings.

erikgrinaker · 2023-09-14T16:00:58Z

Ah, I guess we start the scheduler regardless, but don't actually use it for any rangefeeds unless the setting is enabled. So this is just failing to shut down an idle scheduler. I guess this is all the motivation we need for #110634 @aliher1911.

msbutler · 2023-09-14T16:46:16Z

a heads up that i've bors'd a pr that will deflake this test. You can still repro the bug on a checked out branch.

#110350 (comment)

107788: sql: use 0 for gateway node in statistics table to reduce row count r=j82w a=j82w The SQL Activity pages already use a query to do a group by to remove any node specific information. So storing the gateway node in the system table causes the table to grow by the number of nodes. Then the SQL Activity query has to merge all the data down to a single row. Then the clean up job has to delete all of the rows. This PR changes the default value of the `sql.metrics.statement_details.gateway_node.enabled` to false. This cluster setting was already added initially to help customers with large workloads because there were to many rows being generated. By switching this to false by default all customer will be able to take advantage of the reduce row generation. This is still a cluster setting so if customer still want a per node statistics they can change the cluster setting. Closes: #107787 Release note (sql change): Change the cluster setting `sql.metrics.statement_details.gateway_node.enabled` to default to false. This will reduce the number of rows generated in SQL Statistics. 110217: kvadmission: introduce byte-limit for flow control dispatch r=sumeerbhola a=aadityasondhi This change introduces a byte-limit for dispatch messages that piggyback onto RaftTransport messages to avoid sending a message that is too large. We do this by introducing a new cluster setting that can set the maximum number of bytes of dispatch messages we annotate onto a single RaftTransport message. Informs #104154. Release note: None 110504: c2c: prevent slow quiesce failure in TestDataDriven/initial_scan_span_configs r=stevendanna a=msbutler This small patch prevents a slow quiesce failure during test shutdown by manually closing the replication stream before test shutdown. #110350 observed this failure because the eventStream's underlying rangefeed would not close, implying there may still be some sort of deadlock within Rangefeed.Close(). This patch merely deflakes the TestDataDriven/initial_scan_span_configs test. Informs #110350 Release note: None Co-authored-by: j82w <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Michael Butler <[email protected]>

aliher1911 · 2023-09-18T12:58:47Z

I don't think scheduler shutdown has anything to do with this. Scheduler workers are all idle, but there's no indication that we are waiting for them.
Line github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduler.go:172 indicates that stopper didn't signal quiece and we didn't try to stop anything there. Shutdown logic for the scheduler itself is pretty dump. Once quiesce channel is signalled, it will set bool and wake up all workers. Workers terminate when woken up or when they finished current unit of work before going to sleep.

Rangefeed Start may fail if the attempt to start async task (the rangefeed) fails due to server shutdown. If that happens, Close call would block indefinitely, waiting for the rangefeed tasks that was never started, to terminate. Fixes cockroachdb#110350 Release note: None

110942: rangefeed: Ensure Close is safe even if Start failed r=miretskiy a=miretskiy Rangefeed Start may fail if the attempt to start async task (the rangefeed) fails due to server shutdown. If that happens, Close call would block indefinitely, waiting for the rangefeed tasks that was never started, to terminate. Fixes #110350 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>

msbutler added C-test-failure Broken test (automatically or manually discovered). branch-master Failures and bugs on the master branch. T-disaster-recovery labels Sep 11, 2023

blathers-crl bot added the A-disaster-recovery label Sep 11, 2023

msbutler self-assigned this Sep 12, 2023

msbutler mentioned this issue Sep 12, 2023

c2c: prevent slow quiesce failure in TestDataDriven/initial_scan_span_configs #110504

Merged

msbutler mentioned this issue Sep 13, 2023

bazel: print path to test log and crdb server logs after stress failure #102879

Closed

dt added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 13, 2023

msbutler removed their assignment Sep 13, 2023

msbutler changed the title ~~c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs~~ c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close Sep 13, 2023

msbutler added T-kv KV Team and removed A-disaster-recovery T-disaster-recovery labels Sep 13, 2023

kvoli added A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication and removed T-kv KV Team labels Sep 14, 2023

erikgrinaker assigned aliher1911 Sep 14, 2023

miretskiy mentioned this issue Sep 19, 2023

rangefeed: Ensure Close is safe even if Start failed #110942

Merged

erikgrinaker assigned miretskiy and unassigned aliher1911 Sep 20, 2023

craig bot closed this as completed in ae8106e Sep 21, 2023

nvanbenschoten added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2023

exalate-issue-sync bot removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2023

nvanbenschoten added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 25, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350

c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350

msbutler commented Sep 11, 2023 •

edited

Loading

blathers-crl bot commented Sep 11, 2023

msbutler commented Sep 11, 2023

msbutler commented Sep 12, 2023

msbutler commented Sep 12, 2023

msbutler commented Sep 12, 2023

miretskiy commented Sep 12, 2023

miretskiy commented Sep 12, 2023

msbutler commented Sep 12, 2023

miretskiy commented Sep 12, 2023

dt commented Sep 13, 2023

msbutler commented Sep 13, 2023

msbutler commented Sep 13, 2023 •

edited

Loading

dt commented Sep 13, 2023

miretskiy commented Sep 13, 2023

blathers-crl bot commented Sep 14, 2023

erikgrinaker commented Sep 14, 2023

msbutler commented Sep 14, 2023

erikgrinaker commented Sep 14, 2023

msbutler commented Sep 14, 2023

aliher1911 commented Sep 18, 2023

c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350

c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350

Comments

msbutler commented Sep 11, 2023 • edited Loading

blathers-crl bot commented Sep 11, 2023

msbutler commented Sep 11, 2023

msbutler commented Sep 12, 2023

msbutler commented Sep 12, 2023

msbutler commented Sep 12, 2023

miretskiy commented Sep 12, 2023

miretskiy commented Sep 12, 2023

msbutler commented Sep 12, 2023

miretskiy commented Sep 12, 2023

dt commented Sep 13, 2023

msbutler commented Sep 13, 2023

msbutler commented Sep 13, 2023 • edited Loading

dt commented Sep 13, 2023

miretskiy commented Sep 13, 2023

blathers-crl bot commented Sep 14, 2023

erikgrinaker commented Sep 14, 2023

msbutler commented Sep 14, 2023

erikgrinaker commented Sep 14, 2023

msbutler commented Sep 14, 2023

aliher1911 commented Sep 18, 2023

msbutler commented Sep 11, 2023 •

edited

Loading

msbutler commented Sep 13, 2023 •

edited

Loading