-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c2c: slow quiesce failure TestDataDriven/initial_scan_spanconfigs : rangefeed client hangs on close #110350
Comments
cc @cockroachdb/disaster-recovery |
@miretskiy do you know of any recent rangefeed work that could have caused this? |
aha, I thought of one theory (without evidence) for why this could be occurring: this unit test doesn't manually shut down the replication stream, so maybe some context cancellation wires are getting crossed somewhere. |
…_configs This small patch prevents a slow quiesce failure during test shutdown by manually closing the replication stream before test shutdown. cockroachdb#110350 observed this failure because the eventStream's underlying rangefeed would not close, implying there may still be some sort of deadlock within Rangefeed.Close(). This patch merely deflakes the TestDataDriven/initial_scan_span_configs test. Informs cockroachdb#110350 Release note: None
I can repro this failure via
|
Opened #110504 to deflake this test, but there still may be some deadlock problem in rangefeed.Close() |
There is plenty; but not where you're seeing this. This seems to be the rangefeed library (not to be confused with the "dist sender rangefeed") that seems stuck. |
This happen after mux was turned on? |
I see a mux rangefeed goroutine in the dump....so, yes?
|
Just guessing here:
(that's in the rangefeed.go... Shouldn't we use WithCancelOnQuiesce here?) |
@miretskiy we do don't we? one line above? cockroach/pkg/kv/kvclient/rangefeed/rangefeed.go Lines 244 to 248 in 00cbb30
|
@dt writing something up now and plan to hand off to kv and the rangefeed scalability team |
Handing this off to kv and posting in the rangefeed-scalablity channel about this. To summarize: when the test ends, there's an active c2c job running. When the test runner shuts down a test server, the test occasionally fails because of a slow quiesce message. To repro the slowness (without waiting for the quiesce timeout), run this. The goroutine dump points to rangefeed.Close hanging here:
Looking a little further in the rangefeed client library, the goroutine dump indicates we're waiting in
That stopper channel closes when
example stack in the dump
Figuring out why processEvents is hanging is beyond my pay grade, so I'm handing over to kv. Some things to note:
|
Is this just the usual "condition variables don't compose with context cancellation which is why we don't use them" issue? |
worth checking -- esp since this is new code; but I'm not sure that's the case. it's just a standard producer/consumer |
cc @cockroachdb/replication |
Why is this using the rangefeed scheduler at all? It's disabled by default, and I don't believe we've enabled any metamorphism. Did y'all explicitly enable it or something? |
I don't believe we explicitly enabled this. Here's where we set all of our cluster settings. |
Ah, I guess we start the scheduler regardless, but don't actually use it for any rangefeeds unless the setting is enabled. So this is just failing to shut down an idle scheduler. I guess this is all the motivation we need for #110634 @aliher1911. |
a heads up that i've bors'd a pr that will deflake this test. You can still repro the bug on a checked out branch. |
107788: sql: use 0 for gateway node in statistics table to reduce row count r=j82w a=j82w The SQL Activity pages already use a query to do a group by to remove any node specific information. So storing the gateway node in the system table causes the table to grow by the number of nodes. Then the SQL Activity query has to merge all the data down to a single row. Then the clean up job has to delete all of the rows. This PR changes the default value of the `sql.metrics.statement_details.gateway_node.enabled` to false. This cluster setting was already added initially to help customers with large workloads because there were to many rows being generated. By switching this to false by default all customer will be able to take advantage of the reduce row generation. This is still a cluster setting so if customer still want a per node statistics they can change the cluster setting. Closes: #107787 Release note (sql change): Change the cluster setting `sql.metrics.statement_details.gateway_node.enabled` to default to false. This will reduce the number of rows generated in SQL Statistics. 110217: kvadmission: introduce byte-limit for flow control dispatch r=sumeerbhola a=aadityasondhi This change introduces a byte-limit for dispatch messages that piggyback onto RaftTransport messages to avoid sending a message that is too large. We do this by introducing a new cluster setting that can set the maximum number of bytes of dispatch messages we annotate onto a single RaftTransport message. Informs #104154. Release note: None 110504: c2c: prevent slow quiesce failure in TestDataDriven/initial_scan_span_configs r=stevendanna a=msbutler This small patch prevents a slow quiesce failure during test shutdown by manually closing the replication stream before test shutdown. #110350 observed this failure because the eventStream's underlying rangefeed would not close, implying there may still be some sort of deadlock within Rangefeed.Close(). This patch merely deflakes the TestDataDriven/initial_scan_span_configs test. Informs #110350 Release note: None Co-authored-by: j82w <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Michael Butler <[email protected]>
I don't think scheduler shutdown has anything to do with this. Scheduler workers are all idle, but there's no indication that we are waiting for them. |
Rangefeed Start may fail if the attempt to start async task (the rangefeed) fails due to server shutdown. If that happens, Close call would block indefinitely, waiting for the rangefeed tasks that was never started, to terminate. Fixes cockroachdb#110350 Release note: None
110942: rangefeed: Ensure Close is safe even if Start failed r=miretskiy a=miretskiy Rangefeed Start may fail if the attempt to start async task (the rangefeed) fails due to server shutdown. If that happens, Close call would block indefinitely, waiting for the rangefeed tasks that was never started, to terminate. Fixes #110350 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>
See CI failure here.. I think the relevant goroutine points to a slow rangefeed.Close() in the c2c eventStream.
Jira issue: CRDB-31392
The text was updated successfully, but these errors were encountered: