Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: TestTelemetry failed #75190

Closed
cockroach-teamcity opened this issue Jan 20, 2022 · 7 comments · Fixed by #75210
Closed

sql: TestTelemetry failed #75190

cockroach-teamcity opened this issue Jan 20, 2022 · 7 comments · Fixed by #75210
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

sql.TestTelemetry failed with artifacts on master @ 54c5a2c645756c8cbd3cf7dff1fd354c290b013f:

=== RUN   TestTelemetry
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2553740697
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestTelemetry
    telemetry_test.go:45: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2553740697
--- FAIL: TestTelemetry (56.18s)
=== RUN   TestTelemetry/planning/server
    telemetry.go:90: 
        testdata/telemetry/planning:191: SELECT x FROM (VALUES (1)) AS b(x) WHERE EXISTS(SELECT * FROM (VALUES (1)) AS a(x) WHERE a.x = b.x)
        expected:
        sql.plan.subquery.correlated
        
        found:
        sql.plan.subquery
        sql.plan.subquery.correlated
        --- FAIL: TestTelemetry/planning/server (2.11s)
=== RUN   TestTelemetry/planning
    --- FAIL: TestTelemetry/planning (3.50s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 20, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 20, 2022
@cockroach-teamcity
Copy link
Member Author

sql.TestTelemetry failed with artifacts on master @ 506412ffdb6d0f6d4952d2599b81cb9294a123a1:

=== RUN   TestTelemetry
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2017684981
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestTelemetry
    telemetry_test.go:45: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2017684981
--- FAIL: TestTelemetry (70.18s)
=== RUN   TestTelemetry/planning/server
    telemetry.go:90: 
        testdata/telemetry/planning:191: SELECT x FROM (VALUES (1)) AS b(x) WHERE EXISTS(SELECT * FROM (VALUES (1)) AS a(x) WHERE a.x = b.x)
        expected:
        sql.plan.subquery.correlated
        
        found:
        sql.plan.subquery
        sql.plan.subquery.correlated
        --- FAIL: TestTelemetry/planning/server (2.15s)
=== RUN   TestTelemetry/planning
    --- FAIL: TestTelemetry/planning (3.93s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=deadlock

  • GOFLAGS=-parallel=4

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

+cc @rytaft (are we expecting this after #75145?)

@rytaft
Copy link
Collaborator

rytaft commented Jan 20, 2022

+cc @rytaft (are we expecting this after #75145?)

Thanks, @irfansharif. I'm not sure why this test just suddenly started to flake (we haven't changed anything here recently), but I've opened another PR which should hopefully help.

@irfansharif
Copy link
Contributor

Could it be due to #73876? Here's an e-mail sent out to KV members, but perhaps should be disseminated more widely.

TL;DR: We're merging a PR with a wide blast radius. Failure modes to look out
for and tell me/Arul/#zone-configs about:
- Panics/assertions/test flakes in pkg/spanconfig/... and pkg/ccl/spanconfigccl
- Tests where range counts don't converge as expected, or as quickly as
  expected.
- Tests where the right range is not found, possibly because the right range
  boundaries were not split upon (yet).
- "Trace analysis" test failures that check round trip time for various things
  (pkg/ccl/multiregionccl), finding a different number of RTTs than the test
  expects.
- Version upgrade failures relating to pkg/spanconfig and friends.
- Tests generally taking a lot longer to run than before, possibly timing out
  test packages/etc.

We're about to land *: enable span configs by default, which switches over all
our tests to stop using the system config span in favor of the span configs
infrastructure. This has a pretty wide blast radius, and while we've
future-proofed our test suite as much as we could during review, it's possible
we missed something. This email talks through a few failure modes we can expect
in CI in the coming weeks, why they happen, and what to do about them.

A lot of tests we've written in CRDB implicitly rely on synchrony assumptions
that no longer hold. They're typically around how quickly splits are induced
for new tables, or how quickly configs are applied. To see why, consider unit
tests that make use of a single node or do a lot of heavy lifting through the
first node in the test cluster (our testing libraries default to the first node
unless opted out explicitly). In either form, when we were using gossip, the
effects of a config change appeared basically instantaneously. There was always
delay with the split/merge queue applying these changes, but that too happened
near instantaneously (single digit milliseconds). The span configs
infrastructure uses rangefeeds internally, and propagation of updates is
subject to the closed timestamp target interval, which today defaults to 3s.
This is fine for production use, but can surface synchrony assumptions made in
tests that aren't so easily masked over. In addition to causing tests to flake,
the PR might cause tests to slow down. For a bunch of these tests we've gone
and lowered the closed timestamp target duration, but it's easy to miss things
and no good way to survey all the tests that now take a while longer to
complete (is there?). Another example are tests that use scratch ranges and
apply configs to it. Previously RANGE DEFAULT applied to the entire keyspace,
but now it does only to keyspaces that the SQL layer knows how to encode. For
tests that care about configuring scratch ranges, there's a testing knob, but
it's an opt-in one.

For most of these tests, making them work with the span configs infra was/is
just a matter of slapping a SucceedsSoon block selectively. But still, this is
hard to see looking at opaque timeouts and slower-than-before tests. If you see
anything similar to the above, tell me/Arul/#zone-configs about it and we'll
happily investigate. Given the blast radius, we'd be lucky if we didn't trip up
anything. The PR enabling was written to be easily revertable, and it's
possible we'll have to. Let us know.

@rytaft
Copy link
Collaborator

rytaft commented Jan 20, 2022

Thanks for the info, @irfansharif. I'm not too familiar with how TestTelemetry works, but it seems to rely on this diagnostics reporter:

type Reporter struct {

Perhaps that would be affected by #73876? At any rate, the fixes I'm making to deflake this test should work, but it might fail again further down in the file. This failure is just a bit below the last failure that I fixed. I can't trigger this failure with make stress, so it's hard to know what will cause issues.

@irfansharif
Copy link
Contributor

Should we be liberal with our allow-lists throughout the file then? Would it be an improper test if we did so? I'll try looking through the diagnostic report to see if anything appears obvious.

@rytaft
Copy link
Collaborator

rytaft commented Jan 20, 2022

I don't think that would be a problem. But since I've already posted #75210, let's just see if it flakes again, in which case I'll update the rest of the file with more specific allow-lists.

craig bot pushed a commit that referenced this issue Jan 20, 2022
75191: cli/demo: fix connection string for shared DB server r=otan,knz a=rafiss

fixes #74162

This is needed since the workload fixtures are not installed on the
shared server, so it only has the default databases.

Release note: None

75210: sql: deflake TestTelemetry r=rytaft a=rytaft

This commit deflakes TestTelemetry by adding a more precise
feature-allowlist.

Fixes #75190

Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
@craig craig bot closed this as completed in 4af0f77 Jan 20, 2022
gtr pushed a commit to gtr/cockroach that referenced this issue Jan 24, 2022
This commit deflakes TestTelemetry by adding a more precise
feature-allowlist.

Fixes cockroachdb#75190

Release note: None
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants