TestSQLStatsDataDriven is flaky #89861

fantapop · 2022-10-12T21:01:17Z

It looks like TestSQLStatsDataDriven occasionally flakes. 2 out of the last 1000 runs.

Failed
=== RUN TestSQLStatsDataDriven
test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/83b49f90eb84fbc5d893648fc9fd7294/logTestSQLStatsDataDriven2913649257
test_log_scope.go:79: use -show-logs to present logs inline
=== CONT TestSQLStatsDataDriven
datadriven_test.go:223: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/83b49f90eb84fbc5d893648fc9fd7294/logTestSQLStatsDataDriven2913649257
--- FAIL: TestSQLStatsDataDriven (8.34s)

Its hard to tell whats going on here. But its flaked again recently. I see this at the end of the log:

W221012 19:26:22.304027 2704 kv/kvserver/closedts/sidetransport/receiver.go:139 â‹® [n2] 264 closed timestamps side-transport connection dropped from node: 1
W221012 19:26:22.304076 1929 jobs/registry.go:824 â‹® [n2] 265 canceling all adopted jobs due to stopper quiescing
I221012 19:26:22.305595 2435 sql/stats/automatic_stats.go:509 â‹® [n3] 266 quiescing auto stats refresher
W221012 19:26:22.305700 2441 jobs/registry.go:824 â‹® [n3] 267 canceling all adopted jobs due to stopper quiescing
W221012 19:26:22.307135 2423 sql/sqlliveness/slinstance/slinstance.go:250 â‹® [n3] 268 exiting heartbeat loop
W221012 19:26:22.307801 1049 jobs/registry.go:824 â‹® [n1] 269 canceling all adopted jobs due to stopper quiescing
W221012 19:26:22.307981 1031 sql/sqlliveness/slinstance/slinstance.go:250 â‹® [n1] 270 exiting heartbeat loop
I221012 19:26:22.308208 1226 jobs/registry.go:1217 â‹® [n1] 271 AUTO SPAN CONFIG RECONCILIATION job 804560481633009665: stepping through state succeeded with error:
W221012 19:26:22.308299 1911 sql/sqlliveness/slinstance/slinstance.go:250 â‹® [n2] 272 exiting heartbeat loop
W221012 19:26:22.308306 1226 kv/txn.go:705 â‹® [n1] 273 failure aborting transaction: â€¹node unavailable; try another peerâ€º; abort caused by: context canceled
W221012 19:26:22.308609 2819 kv/kvserver/closedts/sidetransport/receiver.go:139 â‹® [n3] 274 closed timestamps side-transport connection dropped from node: 1

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/6911488?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildDeploymentsSection=true&expandBuildTestsSection=true

Jira issue: CRDB-20469

ajwerner · 2022-10-25T15:36:44Z

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_BazelUnitTests/7113886?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildDeploymentsSection=true&expandBuildTestsSection=true

knz · 2022-10-26T16:01:05Z

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/7151663?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

Skipping the test since its flaky. Informs: cockroachdb#89861 Release note: None

adityamaru · 2022-10-27T00:31:42Z

flaked again https://teamcity.cockroachdb.com/viewLog.html?buildId=7149837&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test - skip is here while we investigate - #90744.

90744: persistedsqlstats: skip logical_plan_sampling_for_explicit_txn r=matthewtodd a=adityamaru Skipping the test since its flaky. Informs: #89861 Release note: None Co-authored-by: adityamaru <[email protected]>

Skipping the test since its flaky. Informs: #89861 Release note: None

98337: sql: enable TestSQLStatsDataDriven r=j82w a=j82w Adding retry logic to the execute command, and enabling the test. All previous builds with failures are no longer available. No failures on testing locally with stress. Epic: none Closes: #89861 Release note: none Co-authored-by: j82w <[email protected]>

andrewbaptist · 2023-03-18T14:40:07Z

I see this was just unskipped - but it just failed again for me:

Looks like several others had the same failure: https://teamcity.cockroachdb.com/test/6015532732023785721?currentProjectId=Cockroach_Ci_TestsGcpLinuxX8664BigVm&expandTestHistoryChartSection=true

This commit skips TestSQLStatsDataDriven. This test is flaky. Part of cockroachdb#89861. Epic: none Release note: None

…99123 98712: storage: expose separate WriteBatch interface r=sumeerbhola a=jbowens Adapt NewUnindexedBatch to no longer take a writeOnly parameter. Instead, a new NewWriteBatch method is exposed that returns a WriteBatch type that does not provide Reader facilities. Future work may remove UnindexedBatch altogether, updating callers to explicitly maintain separate Readers and WriteBatches. Epic: None Release note: None 98807: spanconfig: add TestBlockedKVSubscriberDisablesQueues r=irfansharif a=irfansharif Regression test for #98422 which disables {split,replicate,mvccGC} queues until subscribed to span configs. We're only slightly modifying the existing merge-queue specific TestBlockedKVSubscriberDisablesMerges. Release note: None 98823: sql: address a couple of old TODOs r=yuzefovich a=yuzefovich This commit addresses a couple of old TODOs in the `connExecutor.dispatchToExecutionEngine` method. First, it simply removes the now-stale TODO about needing to `Step()` the txn before executing the query. The TODO mentioned things about the old way FK checks were performed, but we now always run checks in a deferred fashion, and perform the `Step`s correctly there for cascades and checks, thus, the TODO can be removed. (The TODO itself explicitly mentions that it can be removed once we do checks and cascades in the deferred fashion.) Second, it addresses a TODO about how some plan information is saved. Up until 961e66f the cleanup of `planNode` tree and `flow` infra was intertwined, so in 6ae4881 in order to "sample" the plan with correct info (some of which is only available _after_ the execution is done) we delegated that sampling to `planTop.close`. However, with `planNode` tree and `flow` cleanups refactored and de-coupled, we no longer have to close the plan early on. As a result, we now close the plan in a defer right after we create it (now it's the only place we do so), and this commit makes it so that we record the plan info explicitly right after returning from the execution engine, thus, addressing the second TODO. Epic: None Release note: None 98896: tracing: fix WithEventListeners option with verbose parent r=yuzefovich a=yuzefovich Previously, when using `WithEventListeners` span option and not explicitly specifying `WithRecording` option, we could incorrectly use the structured recording. In particular, this was the case when the parent has verbose recording, and this is now fixed. At the moment, this span option is used only by the backup and restore processors, so the impact would be that their verbose recording wouldn't be included into the session recording (perhaps we don't actually expose that anyway). Epic: None Release note: None 99079: sql: fix hint for unknown udfs in views and funcs r=rharding6373 a=rharding6373 Fixes a hint for some instances of an "unknown function" error that was not properly applied previously. Informs: #99002 Epic: None Release note: None 99095: ui: fix inflight request replacement, cleanup database details api r=THardy98 a=THardy98 Changes to allow in-flight request replacement in this PR: #98331 caused breaking changes to `KeyedCachedDataReducer`s. Despite the added `allowReplacementRequests` flag being `false`, in-flight requests would be replaced (specifically, when fetching for initial state). This would prevent all requests but the last one from being stored in the reducer state and leave the remaining requests in a permanent `inFlight` status. Consequently, our pages would break as the permanent `inFlight` status on these requests would prevent us from ever fetching data, putting these pages in a permanent loading state. This PR adds checks to ensure `allowReplacementRequests` is enabled when we receive a response, before deciding whether to omit it from the reducer state. It's important to note that this is still not an ideal solution for `KeyedCachedDataReducer`s, should we ever want to replace inflight requests for one. The checks to replace an in-flight request still occur at the reducer-level, whereas in a `KeyedCachedDataReducer` we'd like them to occur at the state-level (i.e. instead of replacement checks for all requests across the reducer, we should check for requests that impact a specific key's state). This way we scope replacements to each key in the reducer, allowing us to fetch data for each key independently (which is the intended functionality). Enabling this for a `KeyedCachedDataReducer` without making this change I believe will cause the breaking behaviour above. This PR also includes some cleanup to the database details API, namely: - camel casing response fields - encapsulating the database span stats response into its own object (instead of across multiple objects), this helps scope any errors deriving from the query to the single stats object **BEFORE** https://www.loom.com/share/57c9f278376a4551977398316d6ed7b6 **AFTER** https://www.loom.com/share/b2d4e074afca4b42a4f711171de3fd72 Release note (bug fix): Fixed replacement of in-flight requests for `KeyedCachedDataReducer`s to prevent permanent loading on requests stuck on an `inFlight` status. 99097: upgrades: fix backfilling of special roles in user ID migrations r=rafiss a=andyyang890 upgrades: fix system.privileges user ID migration This patch adds a missing case for the public role in the backfilling portion of the system.privileges user ID migration. Release note: None --- upgrades: fix system.database_role_settings user ID migration This patch adds a missing case for the empty role in the backfilling portion of the system.database_role_settings user ID migration. Release note: None --- Part of #87079 99113: server: use no-op cost controller for shared-process tenants r=knz a=stevendanna In the long run, all rate limiting of tenants should be controlled by a tenant capability. However, at the moment, we do not have the infrastructure to read tenant capabilities from the tenant process. Since, the main user of shared-process tenants is the new, experimental unified-architecture were we do not anticipate needing the cost controller for most users, this PR injects a no-op cost controller for shared-process tenants. Epic: CRDB-23559 Release note: None 99120: tracing: allow collection of stack history for Structured spans as well r=dt a=adityamaru Previously, we mandated that the span must be Verbose to be able to collect and emit a Structured event containing stack history. This change permits Structured spans to also collect stack history. Release note: None Epic: none 99123: sql: skip flaky TestSQLStatsDataDriven r=ericharmeling a=ericharmeling This commit skips TestSQLStatsDataDriven. This test is flaky. Part of #89861. Epic: none Release note: None Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Eric Harmeling <[email protected]>

j82w · 2023-07-06T13:30:36Z

Duplicate: #89861

fantapop added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-table-stats Table statistics (and their automatic refresh). labels Oct 12, 2022

michae2 added the T-sql-queries SQL Queries Team label Oct 12, 2022

maryliag assigned matthewtodd Oct 13, 2022

blathers-crl bot added the T-sql-observability label Oct 13, 2022

matthewtodd added A-sql-observability Related to observability of the SQL layer and removed T-sql-queries SQL Queries Team A-sql-table-stats Table statistics (and their automatic refresh). labels Oct 13, 2022

adityamaru added a commit to adityamaru/cockroach that referenced this issue Oct 27, 2022

persistedsqlstats: skip logical_plan_sampling_for_explicit_txn

057a415

Skipping the test since its flaky. Informs: cockroachdb#89861 Release note: None

adityamaru mentioned this issue Oct 27, 2022

persistedsqlstats: skip logical_plan_sampling_for_explicit_txn #90744

Merged

adityamaru added the skipped-test label Oct 27, 2022

blathers-crl bot pushed a commit that referenced this issue Nov 10, 2022

persistedsqlstats: skip logical_plan_sampling_for_explicit_txn

1106366

Skipping the test since its flaky. Informs: #89861 Release note: None

blathers-crl bot mentioned this issue Nov 10, 2022

release-22.2: persistedsqlstats: skip logical_plan_sampling_for_explicit_txn #91683

Merged

maryliag unassigned matthewtodd Feb 27, 2023

maryliag assigned j82w Mar 6, 2023

j82w mentioned this issue Mar 9, 2023

sql: enable TestSQLStatsDataDriven #98337

Merged

craig bot closed this as completed in 6286ba1 Mar 15, 2023

andrewbaptist reopened this Mar 18, 2023

ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Mar 21, 2023

sql: skip flaky TestSQLStatsDataDriven

f141902

This commit skips TestSQLStatsDataDriven. This test is flaky. Part of cockroachdb#89861. Epic: none Release note: None

ericharmeling mentioned this issue Mar 21, 2023

sql: skip flaky TestSQLStatsDataDriven #99123

Merged

maryliag added T-cluster-observability and removed A-sql-observability Related to observability of the SQL layer T-sql-observability labels Jul 5, 2023

j82w closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestSQLStatsDataDriven is flaky #89861

TestSQLStatsDataDriven is flaky #89861

fantapop commented Oct 12, 2022 •

edited

Loading

ajwerner commented Oct 25, 2022

knz commented Oct 26, 2022

adityamaru commented Oct 27, 2022

andrewbaptist commented Mar 18, 2023

j82w commented Jul 6, 2023

TestSQLStatsDataDriven is flaky #89861

TestSQLStatsDataDriven is flaky #89861

Comments

fantapop commented Oct 12, 2022 • edited Loading

ajwerner commented Oct 25, 2022

knz commented Oct 26, 2022

adityamaru commented Oct 27, 2022

andrewbaptist commented Mar 18, 2023

j82w commented Jul 6, 2023

fantapop commented Oct 12, 2022 •

edited

Loading