-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: TestQueryCache deadlocks under stress #105174
Comments
Fixes cockroachdb#104660. Fixes cockroachdb#105174. This change fixes a race/deadlock condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous write to an unprotected global variable used to configure a CCL audit logging feature. Release note (bug fix): This change fixes a race/deadlock condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous writes to an unprotected global variable used to configure a CCL audit logging feature.
Fixes cockroachdb#104660. Fixes cockroachdb#105174. This change fixes a race/deadlock condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous write to an unprotected global variable used to configure a CCL audit logging feature. Release note (bug fix): This change fixes a race/deadlock condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous writes to an unprotected global variable used to configure a CCL audit logging feature.
I'm having a lot of trouble trying to determine the cause of this. It seems difficult to reproduce, running It often produces: even when running for over 30mins. I've tried running the above on commits prior to the introduction of any audit logging logic (I'm not certain if the CCL commit tagged in the issue is the cause of the deadlock, I wanted to bisect without much success) and seem to get the same result (sometimes runs will complete, sometimes not). |
@THardy98 this may not be something related to your change/work. Is the QueryCache something that the cluster obs team maintains? If not maybe this is a problem for the SQL queries team to pick up. |
Not something cluster-obs maintains. Looking at the |
Trying to repro now that the race issue was fixed. |
I couldn't repro this. Will run the test over an extended period on a gceworker with a commit from a couple weeks ago. |
No deadlock after 5 1/2 hours ./dev test pkg/sql -f TestQueryCache --stress -- --define gotags=bazel,gss,deadlock
|
This must have been related to #105227. Closing. |
This test is still skipped: cockroach/pkg/sql/plan_opt_test.go Line 116 in 60fa6c0
|
109378: backupccl: avoid splitting if the split point might be unsafe r=lidorcarmel a=lidorcarmel Restore may use unsafe keys as split points, which may cause unsafe splits between column families, which may cause SQL to fail when reading the row, or worse, return wrong resutls. This commit avoids splitting on keys that might be unsafe. See the issue for more info. Epic: none Informs: #109483 Release note: None. 109631: sql: serialize under stress and unskip TestQueryCache r=cucaroach a=cucaroach This test would timeout under stress+race, this is because it exploits t.Parallel to run 15 tests concurrently that each fire up a TestServer. Fix the timeout by running the tests serially under stress. Fixes: #105174 Epic: none Release note: none 109683: spanconfig: skip protected timestamps on non-table data r=arulajmani a=aadityasondhi Previously, we were placing a protected timestamp using the `EverythingSpan` which covered the entire keyspace when targeting a cluster backup. This was non-ideal because not all are used for backup. This is especially problematic for high churn ranges, such as node liveness and timeseries, that can accumulate lots of MVCC garbage very quickly. Placing a protected timestamp on these ranges, thus preventing the MVCC GC to run, can cause badness. This patch introduces a new span that covers the keyspace excluded from backup. When we encounter a span that is within those bounds, we skip placing a protected timestamp on it. Fixes: #102338 Release note: None 109720: cluster-ui: break circular import of `rootActions` r=xinhaoz a=xinhaoz Importing `rootActions` from `reducers.ts` in cluster-ui was causing a circular import, preventing one of the redux fields experiencing the cyclic dependency from having their reducer populated, which omitted that field from the store altogether. Currently, this field is `uiConfig` which is affecting features that rely on checking the sql role of a user, such as displaying the `Reset SQL Stats` button. This commit extracts `rootActions` into its own file to remove the cyclic dependencies created. Fixes: #97996 Release note (bug fix): On CC, `Reset Sql Stats` button is now visible if the user has admin role. Using a production buiild 23.1 cluster-ui version: <img width="1909" alt="image" src="https://github.com/cockroachdb/cockroach/assets/20136951/348a941a-3be5-42ad-8b16-47cbf48f3f19"> 109734: bulk,ccl: refactor tracing aggregator integration r=stevendanna a=adityamaru The motivation for this change was to fix the leaked goroutine in #109658 but it led to a larger cleanup as explained below. In #108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown. With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early. As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page. This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method. Fixes: #109658 Release note: None Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: adityamaru <[email protected]>
Describe the problem
dev test //pkg/sql -f TestQueryCache --stress
deadlocks.This may be related to #104660.
To Reproduce
The test should not deadlock.
Jira issue: CRDB-28910
Epic: CRDB-28893
The text was updated successfully, but these errors were encountered: