Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: TestQueryCache deadlocks under stress #105174

Closed
knz opened this issue Jun 20, 2023 · 8 comments · Fixed by #109631
Closed

sql: TestQueryCache deadlocks under stress #105174

knz opened this issue Jun 20, 2023 · 8 comments · Fixed by #109631
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). skipped-test T-sql-queries SQL Queries Team

Comments

@knz
Copy link
Contributor

knz commented Jun 20, 2023

Describe the problem

dev test //pkg/sql -f TestQueryCache --stress deadlocks.

This may be related to #104660.

To Reproduce

The test should not deadlock.

Jira issue: CRDB-28910
Epic: CRDB-28893

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). labels Jun 20, 2023
craig bot pushed a commit that referenced this issue Jun 20, 2023
105175: sql: skip TestQueryCache r=herkolategan a=knz

Informs #105174.
Informs #104660.
Informs #105100.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@THardy98 THardy98 self-assigned this Jun 20, 2023
THardy98 pushed a commit to THardy98/cockroach that referenced this issue Jun 21, 2023
Fixes cockroachdb#104660.
Fixes cockroachdb#105174.

This change fixes a race/deadlock condition that can occur when multiple
SQLServers are created simulatenously, causing simultaneous write to an
unprotected global variable used to configure a CCL audit logging
feature.

Release note (bug fix): This change fixes a race/deadlock condition that
can occur when multiple SQLServers are created simulatenously, causing
simultaneous writes to an unprotected global variable used to configure
a CCL audit logging feature.
THardy98 pushed a commit to THardy98/cockroach that referenced this issue Jun 21, 2023
Fixes cockroachdb#104660.
Fixes cockroachdb#105174.

This change fixes a race/deadlock condition that can occur when multiple
SQLServers are created simulatenously, causing simultaneous write to an
unprotected global variable used to configure a CCL audit logging
feature.

Release note (bug fix): This change fixes a race/deadlock condition that
can occur when multiple SQLServers are created simulatenously, causing
simultaneous writes to an unprotected global variable used to configure
a CCL audit logging feature.
@THardy98
Copy link

THardy98 commented Jun 23, 2023

I'm having a lot of trouble trying to determine the cause of this.

It seems difficult to reproduce, running ./dev test pkg/sql -f TestQueryCache --stress -- --define gotags=bazel,gss,deadlock locally and on gceworker doesn't reproduce reliably (I sometimes add --stress-args='-p=4' to not blow up my/gceworker machine, not sure if that has a huge impact on reproducability).

It often produces:
0 runs so far, 0 failures, over <'x' amount of time>
for the entire run, but very rarely an actual deadlock detection:
POTENTIAL DEADLOCK: ...

even when running for over 30mins.

I've tried running the above on commits prior to the introduction of any audit logging logic (I'm not certain if the CCL commit tagged in the issue is the cause of the deadlock, I wanted to bisect without much success) and seem to get the same result (sometimes runs will complete, sometimes not).

@knz
Copy link
Contributor Author

knz commented Jun 28, 2023

@THardy98 this may not be something related to your change/work.

Is the QueryCache something that the cluster obs team maintains? If not maybe this is a problem for the SQL queries team to pick up.

@THardy98
Copy link

Not something cluster-obs maintains.

Looking at the git blame it does look like SQL queries.

@knz knz added T-sql-queries SQL Queries Team and removed T-cluster-observability labels Jun 28, 2023
@cucaroach cucaroach self-assigned this Jun 28, 2023
@cucaroach
Copy link
Contributor

Trying to repro now that the race issue was fixed.

@msirek
Copy link
Contributor

msirek commented Jul 7, 2023

I couldn't repro this. Will run the test over an extended period on a gceworker with a commit from a couple weeks ago.

@msirek
Copy link
Contributor

msirek commented Jul 8, 2023

No deadlock after 5 1/2 hours

./dev test pkg/sql -f TestQueryCache --stress -- --define gotags=bazel,gss,deadlock

...
454165 runs so far, 0 failures, over 5h29m10s
454280 runs so far, 0 failures, over 5h29m15s
454397 runs so far, 0 failures, over 5h29m20s
454512 runs so far, 0 failures, over 5h29m25s
454630 runs so far, 0 failures, over 5h29m30s
454747 runs so far, 0 failures, over 5h29m35s
454860 runs so far, 0 failures, over 5h29m40s
[2,757 / 2,758] Testing //pkg/sql:sql_test; 19780s linux-sandbox, remote-cache

msirek@gceworker-msirek:~/go/src/github.com/cockroachdb/cockroach$ git log -1
commit ade0dfab8882043a4178966f5fb9b7b94b3e86c3 (HEAD)
Author: Ricky Stewart <[email protected]>
Date:   Thu Jun 22 11:53:22 2023 -0500

    bazel: fix builds under `--force_build_cdeps`
    
    This codepath isn't exercised in CI, so we didn't notice it was broken
    with the Bazel 6.2.1 upgrade.
    
    Epic: none
    Release note: None

@cucaroach
Copy link
Contributor

This must have been related to #105227. Closing.

@rafiss
Copy link
Collaborator

rafiss commented Aug 18, 2023

This test is still skipped:

skip.WithIssue(t, 105174, "flaky test")

@rafiss rafiss reopened this Aug 18, 2023
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Aug 18, 2023
craig bot pushed a commit that referenced this issue Aug 31, 2023
109378: backupccl: avoid splitting if the split point might be unsafe r=lidorcarmel a=lidorcarmel

Restore may use unsafe keys as split points, which may cause unsafe splits
between column families, which may cause SQL to fail when reading the row, or
worse, return wrong resutls.

This commit avoids splitting on keys that might be unsafe.

See the issue for more info.

Epic: none
Informs: #109483

Release note: None.

109631: sql: serialize under stress and unskip TestQueryCache r=cucaroach a=cucaroach

This test would timeout under stress+race, this is because it exploits
t.Parallel to run 15 tests concurrently that each fire up a TestServer.
Fix the timeout by running the tests serially under stress.

Fixes: #105174
Epic: none
Release note: none


109683: spanconfig: skip protected timestamps on non-table data  r=arulajmani a=aadityasondhi

Previously, we were placing a protected timestamp using the
`EverythingSpan` which covered the entire keyspace when targeting a
cluster backup. This was non-ideal because not all are used for backup.
This is especially problematic for high churn ranges, such as node
liveness and timeseries, that can accumulate lots of MVCC garbage very
quickly. Placing a protected timestamp on these ranges, thus preventing
the MVCC GC to run, can cause badness.

This patch introduces a new span that covers the keyspace excluded from
backup. When we encounter a span that is within those bounds, we skip
placing a protected timestamp on it.

Fixes: #102338

Release note: None

109720: cluster-ui: break circular import of `rootActions` r=xinhaoz a=xinhaoz

Importing  `rootActions` from `reducers.ts` in cluster-ui was causing a circular import, preventing one of the redux fields experiencing the cyclic dependency from having their reducer populated, which omitted that field from the store altogether. Currently, this field is `uiConfig` which is affecting features that rely on checking the sql role of a user, such as displaying the `Reset SQL Stats` button.

This commit extracts `rootActions` into its own file to remove the cyclic dependencies created.

Fixes: #97996

Release note (bug fix): On CC, `Reset Sql Stats` button is now visible if the user has admin role.

Using a production buiild 23.1 cluster-ui version:
<img width="1909" alt="image" src="https://github.com/cockroachdb/cockroach/assets/20136951/348a941a-3be5-42ad-8b16-47cbf48f3f19">

109734: bulk,ccl: refactor tracing aggregator integration r=stevendanna a=adityamaru

The motivation for this change was to fix the leaked goroutine in #109658 but it led to a larger cleanup as explained below.

In #108458 and other PRs we taught the processors of backup, restore and c2c to periodically send metas containing the TracingAggregatorEvents, for the coordinator to then process and persist for improved observability. In that logic we were defensive against a scenario in which the processor received a root context with no tracing span. In which case we never initialized the tracing aggregator on the processor. This should never be possible in production code, but we want to prevent failing the replication job if this were to happen. As part of this defense we also returned `nil` instead of the meta in `Next()`. What We didn't realize was that return a nil row and nil meta indicates that the consumer has been completely drained and causes the processor to pre-emptively drain and shutdown.

With this change, if the root context does not have a tracing span and we are unable to init a tracing aggregator, we simply do not allow the timer controlling the return of the TracingAggergatorEvents to fire. By doing this, we avoid a situation where a nil row and nil meta are returned too early.

As part of this cleanup this change also simplifies the tracing aggregator. The aggregator is no longer responsible for creating a child span and registering a listener. Instead it simply returns an object that can be registered as an event listener by the caller. The former was an approach that was necessary when we wanted to decorate the span with LazyTags, but nobody uses LazyTags and we have a much better workflow to consume aggregator events with the `Advanced Debugging` job details page.

This simplification was motivated by the fact that `processorbase.StartInternal` now takes event listeners that can be registered with the tracing span of the context that hangs of the processor, and is accessible via the `processor.Ctx()` method.

Fixes: #109658
Release note: None

Co-authored-by: Lidor Carmel <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: adityamaru <[email protected]>
@craig craig bot closed this as completed in 1cf1edc Aug 31, 2023
@github-project-automation github-project-automation bot moved this from Triage to Done in SQL Queries Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). skipped-test T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants