sql: TestQueryCache deadlocks under stress #105174

knz · 2023-06-20T10:35:17Z

Describe the problem

dev test //pkg/sql -f TestQueryCache --stress deadlocks.

This may be related to #104660.

To Reproduce

The test should not deadlock.

The text was updated successfully, but these errors were encountered:

105175: sql: skip TestQueryCache r=herkolategan a=knz Informs #105174. Informs #104660. Informs #105100. Co-authored-by: Raphael 'kena' Poss <[email protected]>

Fixes cockroachdb#104660. Fixes cockroachdb#105174. This change fixes a race/deadlock condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous write to an unprotected global variable used to configure a CCL audit logging feature. Release note (bug fix): This change fixes a race/deadlock condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous writes to an unprotected global variable used to configure a CCL audit logging feature.

THardy98 · 2023-06-23T20:42:11Z

I'm having a lot of trouble trying to determine the cause of this.

It seems difficult to reproduce, running ./dev test pkg/sql -f TestQueryCache --stress -- --define gotags=bazel,gss,deadlock locally and on gceworker doesn't reproduce reliably (I sometimes add --stress-args='-p=4' to not blow up my/gceworker machine, not sure if that has a huge impact on reproducability).

It often produces:
0 runs so far, 0 failures, over <'x' amount of time>
for the entire run, but very rarely an actual deadlock detection:
POTENTIAL DEADLOCK: ...

even when running for over 30mins.

I've tried running the above on commits prior to the introduction of any audit logging logic (I'm not certain if the CCL commit tagged in the issue is the cause of the deadlock, I wanted to bisect without much success) and seem to get the same result (sometimes runs will complete, sometimes not).

knz · 2023-06-28T12:25:37Z

@THardy98 this may not be something related to your change/work.

Is the QueryCache something that the cluster obs team maintains? If not maybe this is a problem for the SQL queries team to pick up.

THardy98 · 2023-06-28T14:54:58Z

Not something cluster-obs maintains.

Looking at the git blame it does look like SQL queries.

cucaroach · 2023-06-28T17:46:53Z

Trying to repro now that the race issue was fixed.

msirek · 2023-07-07T19:42:12Z

I couldn't repro this. Will run the test over an extended period on a gceworker with a commit from a couple weeks ago.

msirek · 2023-07-08T01:11:27Z

No deadlock after 5 1/2 hours

./dev test pkg/sql -f TestQueryCache --stress -- --define gotags=bazel,gss,deadlock

...
454165 runs so far, 0 failures, over 5h29m10s
454280 runs so far, 0 failures, over 5h29m15s
454397 runs so far, 0 failures, over 5h29m20s
454512 runs so far, 0 failures, over 5h29m25s
454630 runs so far, 0 failures, over 5h29m30s
454747 runs so far, 0 failures, over 5h29m35s
454860 runs so far, 0 failures, over 5h29m40s
[2,757 / 2,758] Testing //pkg/sql:sql_test; 19780s linux-sandbox, remote-cache

msirek@gceworker-msirek:~/go/src/github.com/cockroachdb/cockroach$ git log -1
commit ade0dfab8882043a4178966f5fb9b7b94b3e86c3 (HEAD)
Author: Ricky Stewart <[email protected]>
Date:   Thu Jun 22 11:53:22 2023 -0500

    bazel: fix builds under `--force_build_cdeps`
    
    This codepath isn't exercised in CI, so we didn't notice it was broken
    with the Bazel 6.2.1 upgrade.
    
    Epic: none
    Release note: None

cucaroach · 2023-07-10T16:18:11Z

This must have been related to #105227. Closing.

rafiss · 2023-08-18T16:59:50Z

This test is still skipped:

cockroach/pkg/sql/plan_opt_test.go

Line 116 in 60fa6c0

skip.WithIssue(t, 105174, "flaky test")

109378: backupccl: avoid splitting if the split point might be unsafe r=lidorcarmel a=lidorcarmel Restore may use unsafe keys as split points, which may cause unsafe splits between column families, which may cause SQL to fail when reading the row, or worse, return wrong resutls. This commit avoids splitting on keys that might be unsafe. See the issue for more info. Epic: none Informs: #109483 Release note: None. 109631: sql: serialize under stress and unskip TestQueryCache r=cucaroach a=cucaroach This test would timeout under stress+race, this is because it exploits t.Parallel to run 15 tests concurrently that each fire up a TestServer. Fix the timeout by running the tests serially under stress. Fixes: #105174 Epic: none Release note: none 109683: spanconfig: skip protected timestamps on non-table data r=arulajmani a=aadityasondhi Previously, we were placing a protected timestamp using the `EverythingSpan` which covered the entire keyspace when targeting a cluster backup. This was non-ideal because not all are used for backup. This is especially problematic for high churn ranges, such as node liveness and timeseries, that can accumulate lots of MVCC garbage very quickly. Placing a protected timestamp on these ranges, thus preventing the MVCC GC to run, can cause badness. This patch introduces a new span that covers the keyspace excluded from backup. When we encounter a span that is within those bounds, we skip placing a protected timestamp on it. Fixes: #102338 Release note: None 109720: cluster-ui: break circular import of `rootActions` r=xinhaoz a=xinhaoz Importing `rootActions` from `reducers.ts` in cluster-ui was causing a circular import, preventing one of the redux fields experiencing the cyclic dependency from having their reducer populated, which omitted that field from the store altogether. Currently, this field is `uiConfig` which is affecting features that rely on checking the sql role of a user, such as displaying the `Reset SQL Stats` button. This commit extracts `rootActions` into its own file to remove the cyclic dependencies created. Fixes: #97996 Release note (bug fix): On CC, `Reset Sql Stats` button is now visible if the user has admin role. Using a production buiild 23.1 cluster-ui version: <img width="1909" alt="image" src="https://github.com/cockroachdb/cockroach/assets/20136951/348a941a-3be5-42ad-8b16-47cbf48f3f19"> 109734: bulk,ccl: refactor tracing aggregator integration r=stevendanna a=adityamaru The motivation for this change was to fix the leaked goroutine in #109658 but it led to a larger cleanup as explained below. In #108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown. With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early. As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page. This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method. Fixes: #109658 Release note: None Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: adityamaru <[email protected]>

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). labels Jun 20, 2023

blathers-crl bot added the T-cluster-observability label Jun 20, 2023

knz mentioned this issue Jun 20, 2023

sql: skip TestQueryCache #105175

Merged

craig bot pushed a commit that referenced this issue Jun 20, 2023

Merge #105175

0fef4be

105175: sql: skip TestQueryCache r=herkolategan a=knz Informs #105174. Informs #104660. Informs #105100. Co-authored-by: Raphael 'kena' Poss <[email protected]>

THardy98 self-assigned this Jun 20, 2023

knz mentioned this issue Jun 21, 2023

sql: fix race condition on audit logging CCL hook #105227

Merged

rafiss added the skipped-test label Jun 25, 2023

knz added T-sql-queries SQL Queries Team and removed T-cluster-observability labels Jun 28, 2023

knz unassigned THardy98 Jun 28, 2023

cucaroach self-assigned this Jun 28, 2023

cucaroach closed this as completed Jul 10, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

rafiss reopened this Aug 18, 2023

github-project-automation bot moved this from Done to Triage in SQL Queries Aug 18, 2023

cucaroach mentioned this issue Aug 28, 2023

sql: serialize under stress and unskip TestQueryCache #109631

Merged

craig bot closed this as completed in 1cf1edc Aug 31, 2023

github-project-automation bot moved this from Triage to Done in SQL Queries Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: TestQueryCache deadlocks under stress #105174

sql: TestQueryCache deadlocks under stress #105174

knz commented Jun 20, 2023 •

edited

Loading

THardy98 commented Jun 23, 2023 •

edited

Loading

knz commented Jun 28, 2023

THardy98 commented Jun 28, 2023

cucaroach commented Jun 28, 2023

msirek commented Jul 7, 2023

msirek commented Jul 8, 2023

cucaroach commented Jul 10, 2023

rafiss commented Aug 18, 2023

sql: TestQueryCache deadlocks under stress #105174

sql: TestQueryCache deadlocks under stress #105174

Comments

knz commented Jun 20, 2023 • edited Loading

THardy98 commented Jun 23, 2023 • edited Loading

knz commented Jun 28, 2023

THardy98 commented Jun 28, 2023

cucaroach commented Jun 28, 2023

msirek commented Jul 7, 2023

msirek commented Jul 8, 2023

cucaroach commented Jul 10, 2023

rafiss commented Aug 18, 2023

knz commented Jun 20, 2023 •

edited

Loading

THardy98 commented Jun 23, 2023 •

edited

Loading