Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350

Closed
msbutler opened this issue Sep 11, 2023 · 20 comments · Fixed by #110942
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@msbutler
Copy link
Collaborator

msbutler commented Sep 11, 2023

See CI failure here.. I think the relevant goroutine points to a slow rangefeed.Close() in the c2c eventStream.

goroutine 40449 [chan receive, 14 minutes]:
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).Close.func1()
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:258 +0x3c
sync.(*Once).doSlow(0x0?, 0x0?)
	GOROOT/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	GOROOT/src/sync/once.go:65
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).Close(0xc00b68c220?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:256 +0x47
github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer.(*eventStream).Close(0xc00b401c00, {0x7a393b0, 0xc0099c34a0})
	github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer/event_stream.go:234 +0x4d
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*projectSetProcessor).close(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:345 +0x9a
github.com/cockroachdb/cockroach/pkg/sql/rowexec.newProjectSetProcessor.func1()
	github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:105 +0x1d
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).moveToTrailingMeta(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:689 +0x3b8
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).DrainHelper(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:570 +0x196
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*projectSetProcessor).Next(0xc0060c5200)
	github.com/cockroachdb/cockroach/pkg/sql/rowexec/project_set.go:328 +0x1f4
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).DrainHelper(0xc007076000)
	github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:566 +0x12e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).DrainMeta(0xc007076000)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:285 +0x211
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*invariantsChecker).DrainMeta(0xc00c3ea800)
	github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:119 +0x3c
github.com/cockroachdb/cockroach/pkg/sql/colexecop.MetadataSources.DrainMeta.func1()
	github.com/cockroachdb/cockroach/pkg/sql/colexecop/operator.go:464 +0x6b
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc011feada0?)
	github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:92 +0x62
github.com/cockroachdb/cockroach/pkg/sql/colexecop.MetadataSources.DrainMeta({0xc00ee7c140?, 0x7a78800?, 0xc00d59fd40?})
	github.com/cockroachdb/cockroach/pkg/sql/colexecop/operator.go:462 +0x8a
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).Run(0xc00eee0fa0, {0x7a39308?, 0xc009f94780?})
...

Jira issue: CRDB-31392

@msbutler msbutler added C-test-failure Broken test (automatically or manually discovered). branch-master Failures and bugs on the master branch. T-disaster-recovery labels Sep 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 11, 2023

cc @cockroachdb/disaster-recovery

@msbutler
Copy link
Collaborator Author

@miretskiy do you know of any recent rangefeed work that could have caused this?

@msbutler
Copy link
Collaborator Author

aha, I thought of one theory (without evidence) for why this could be occurring: this unit test doesn't manually shut down the replication stream, so maybe some context cancellation wires are getting crossed somewhere.

@msbutler msbutler self-assigned this Sep 12, 2023
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 12, 2023
…_configs

This small patch prevents a slow quiesce failure during test shutdown by
manually closing the replication stream before test shutdown. cockroachdb#110350 observed
this failure because the eventStream's underlying rangefeed would not close,
implying there may still be some sort of deadlock within Rangefeed.Close().
This patch merely deflakes the TestDataDriven/initial_scan_span_configs test.

Informs cockroachdb#110350

Release note: None
@msbutler
Copy link
Collaborator Author

I can repro this failure via

git checkout 746388795af76a43cc21d072d06bc646b8179374
./dev test pkg/ccl/streamingccl/streamingest -f TestDataDriven/initial_scan_span_configs --stress --timeout 120s

@msbutler
Copy link
Collaborator Author

Opened #110504 to deflake this test, but there still may be some deadlock problem in rangefeed.Close()

@miretskiy
Copy link
Contributor

@miretskiy do you know of any recent rangefeed work that could have caused this?

There is plenty; but not where you're seeing this. This seems to be the rangefeed library (not to be confused with the "dist sender rangefeed") that seems stuck.

@miretskiy
Copy link
Contributor

This happen after mux was turned on?

@msbutler
Copy link
Collaborator Author

I see a mux rangefeed goroutine in the dump....so, yes?

* goroutine 29278 [semacquire, 2 minutes]:
* sync.runtime_Semacquire(0xc00335f860?)
* 	GOROOT/src/runtime/sema.go:62 +0x27
* sync.(*WaitGroup).Wait(0xb470301?)
* 	GOROOT/src/sync/waitgroup.go:116 +0x4b
* golang.org/x/sync/errgroup.(*Group).Wait(0xc00bf42700)
* 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:53 +0x27
* github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.Wait({0xc00bf42700?, {0x7a39308?, 0xc00fa9dae0?}})
* 	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:144 +0x4a
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.muxRangeFeed({0x7a393b0, 0xc00103d7d0}, {0x1, 0x1, 0x1, 0x0, {0x0, 0x0, 0x0}}, {0xc00bf42680, ...}, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:102 +0x490
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeedSpans(0xc00f33d600, {0x7a39308, 0xc00fa9d810}, {0xc00bf42680?, 0x1, 0x1}, 0xc0099b7bc0, {0xc00bf42300, 0x3, 0x4})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:234 +0x490
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeed(0x582ea80?, {0x7a39308, 0xc00fa9d810}, {0xc00b737080, 0x1, 0x2?}, {0xd04b5c?, 0x0?, 0x0?}, 0xc0099b7bc0, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:183 +0x29a
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*dbAdapter).RangeFeed(0x1e3b040?, {0x7a39308?, 0xc00fa9d810?}, {0xc00b737080?, 0xc009a47fd0?, 0x1269c26?}, {0x7a393b0?, 0xb7371a0?, 0xc0?}, 0xc0099b7bc0, ...)
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/db_adapter.go:79 +0x65
* github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed.(*RangeFeed).run.func1({0x7a39308?, 0xc00fa9d810?})
* 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go:314 +0xb2
* github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
* 	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
* golang.org/x/sync/errgroup.(*Group).Go.func1()
* 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
* created by golang.org/x/sync/errgroup.(*Group).Go
* 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5

@miretskiy
Copy link
Contributor

Just guessing here:

	if err := f.stopper.RunAsyncTask(ctx, "rangefeed", runWithFrontier); err != nil {
		f.cancel()
		return err
	}

(that's in the rangefeed.go... Shouldn't we use WithCancelOnQuiesce here?)

@dt dt added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 13, 2023
@dt
Copy link
Member

dt commented Sep 13, 2023

@miretskiy we do don't we? one line above?

ctx, f.cancel = f.stopper.WithCancelOnQuiesce(ctx)
if err := f.stopper.RunAsyncTask(ctx, "rangefeed", runWithFrontier); err != nil {
f.cancel()
return err
}

@msbutler
Copy link
Collaborator Author

@dt writing something up now and plan to hand off to kv and the rangefeed scalability team

@msbutler
Copy link
Collaborator Author

msbutler commented Sep 13, 2023

Handing this off to kv and posting in the rangefeed-scalablity channel about this.

To summarize: when the test ends, there's an active c2c job running. When the test runner shuts down a test server, the test occasionally fails because of a slow quiesce message. To repro the slowness (without waiting for the quiesce timeout), run this.

The goroutine dump points to rangefeed.Close hanging here:

Looking a little further in the rangefeed client library, the goroutine dump indicates we're waiting in rangefeed.Close() for this stopper channel:

That stopper channel closes when rangefeed.run() returns, but the dump indicates that it has not returned as its waiting on the processEvents goroutine to finish.

err := ctxgroup.GoAndWait(ctx, rangeFeedTask, processEventsTask)

example stack in the dump

goroutine 428 [sync.Cond.Wait, 1 minutes]:
sync.runtime_notifyListWait(0xc006065390, 0x1)
  GOROOT/src/runtime/sema.go:527 +0x14c
sync.(*Cond).Wait(0xc000c6c780?)
  GOROOT/src/sync/cond.go:70 +0x8c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Scheduler).processEvents(0xc0063e20c0, {0x7a4e820, 0xc0010f98c0})
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduler.go:295 +0x4a6
github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed.(*Scheduler).Start.func1({0x7a4e820, 0xc0010f98c0})
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduler.go:162 +0x11f
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

Figuring out why processEvents is hanging is beyond my pay grade, so I'm handing over to kv.

Some things to note:

  • I don't think this is a mux rangefeed problem. If I set changefeed.mux_rangefeed.enabled to false during test cluster startup, the test still hangs. See test modification here.
  • if the c2c shuts job down cleanly via job completion, I don't see this hanging behavior. I may deflake this test by cleanly shutting down the job. Obviously this patch would not solve this bug here.

@msbutler msbutler removed their assignment Sep 13, 2023
@msbutler msbutler changed the title c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close Sep 13, 2023
@dt
Copy link
Member

dt commented Sep 13, 2023

sync.Cond.Wait

Is this just the usual "condition variables don't compose with context cancellation which is why we don't use them" issue?

@miretskiy
Copy link
Contributor

sync.Cond.Wait

Is this just the usual "condition variables don't compose with context cancellation which is why we don't use them" issue?

worth checking -- esp since this is new code; but I'm not sure that's the case. it's just a standard producer/consumer
patter where consumer (processEvents) is getting signaled if the processor stopped.

@kvoli kvoli added A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication and removed T-kv KV Team labels Sep 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 14, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

Why is this using the rangefeed scheduler at all? It's disabled by default, and I don't believe we've enabled any metamorphism. Did y'all explicitly enable it or something?

@msbutler
Copy link
Collaborator Author

I don't believe we explicitly enabled this. Here's where we set all of our cluster settings.

@erikgrinaker
Copy link
Contributor

Ah, I guess we start the scheduler regardless, but don't actually use it for any rangefeeds unless the setting is enabled. So this is just failing to shut down an idle scheduler. I guess this is all the motivation we need for #110634 @aliher1911.

@msbutler
Copy link
Collaborator Author

a heads up that i've bors'd a pr that will deflake this test. You can still repro the bug on a checked out branch.

#110350 (comment)

craig bot pushed a commit that referenced this issue Sep 14, 2023
107788: sql: use 0 for gateway node in statistics table to reduce row count r=j82w a=j82w

The SQL Activity pages already use a query to do a group by to remove any node specific information. So storing the gateway node in the system table causes the table to grow by the number of nodes. Then the SQL Activity query has to merge all the data down to a single row. Then the clean up job has to delete all of the rows.

This PR changes the default value of the
`sql.metrics.statement_details.gateway_node.enabled` to false. This cluster setting was already added initially to help customers with large workloads because there were to many rows being generated. By switching this to false by default all customer will be able to take advantage of the reduce row generation. This is still a cluster setting so if customer still want a per node statistics they can change the cluster setting.

Closes: #107787

Release note (sql change): Change the cluster setting `sql.metrics.statement_details.gateway_node.enabled` to default to false. This will reduce the number of rows generated in SQL Statistics.

110217: kvadmission: introduce byte-limit for flow control dispatch r=sumeerbhola a=aadityasondhi

This change introduces a byte-limit for dispatch messages that piggyback onto RaftTransport messages to avoid sending a message that is too large.

We do this by introducing a new cluster setting that can set the maximum number of bytes of dispatch messages we annotate onto a single RaftTransport message.

Informs #104154.

Release note: None

110504: c2c: prevent slow quiesce failure in TestDataDriven/initial_scan_span_configs r=stevendanna a=msbutler

This small patch prevents a slow quiesce failure during test shutdown by manually closing the replication stream before test shutdown. #110350 observed this failure because the eventStream's underlying rangefeed would not close, implying there may still be some sort of deadlock within Rangefeed.Close(). This patch merely deflakes the TestDataDriven/initial_scan_span_configs test.

Informs #110350

Release note: None

Co-authored-by: j82w <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@aliher1911
Copy link
Contributor

I don't think scheduler shutdown has anything to do with this. Scheduler workers are all idle, but there's no indication that we are waiting for them.
Line github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed/scheduler.go:172 indicates that stopper didn't signal quiece and we didn't try to stop anything there. Shutdown logic for the scheduler itself is pretty dump. Once quiesce channel is signalled, it will set bool and wake up all workers. Workers terminate when woken up or when they finished current unit of work before going to sleep.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Sep 19, 2023
Rangefeed Start may fail if the attempt to start async
task (the rangefeed) fails due to server shutdown.
If that happens, Close call would block indefinitely,
waiting for the rangefeed tasks that was never started,
to terminate.

Fixes cockroachdb#110350

Release note: None
@erikgrinaker erikgrinaker assigned miretskiy and unassigned aliher1911 Sep 20, 2023
craig bot pushed a commit that referenced this issue Sep 21, 2023
110942: rangefeed: Ensure Close is safe even if Start failed r=miretskiy a=miretskiy

Rangefeed Start may fail if the attempt to start async task (the rangefeed) fails due to server shutdown. If that happens, Close call would block indefinitely, waiting for the rangefeed tasks that was never started, to terminate.

Fixes #110350

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in ae8106e Sep 21, 2023
@nvanbenschoten nvanbenschoten added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2023
@exalate-issue-sync exalate-issue-sync bot removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 21, 2023
@nvanbenschoten nvanbenschoten added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants