Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/changefeedccl/cdceval: TestEvaluator failed #108348

Closed
cockroach-teamcity opened this issue Aug 8, 2023 · 0 comments · Fixed by #108375
Closed

ccl/changefeedccl/cdceval: TestEvaluator failed #108348

cockroach-teamcity opened this issue Aug 8, 2023 · 0 comments · Fixed by #108375
Assignees
Labels
A-cdc Change Data Capture branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-cdc
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 8, 2023

ccl/changefeedccl/cdceval.TestEvaluator failed with artifacts on master @ 1382b26a97bf6f70a07d363dc319283c173359eb:

=== RUN   TestEvaluator
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/4f079cd13f8a776a99f45e473cd32df8/logTestEvaluator4214666620
    test_log_scope.go:81: use -show-logs to present logs inline
=== CONT  TestEvaluator
    expr_eval_test.go:628: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/4f079cd13f8a776a99f45e473cd32df8/logTestEvaluator4214666620
--- FAIL: TestEvaluator (16.07s)
=== RUN   TestEvaluator/user_defined_function
    expr_eval_test.go:573: timeout reading row
    --- FAIL: TestEvaluator/user_defined_function (5.04s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-30434

Epic CRDB-11732

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-cdc labels Aug 8, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Aug 8, 2023
@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Aug 8, 2023
@miretskiy miretskiy self-assigned this Aug 8, 2023
craig bot pushed a commit that referenced this issue Aug 8, 2023
106130: sql: add extra information to protocol errors in bind r=rafiss a=cucaroach

A user is running into mysterious protocol errors when using prepared
statements. Add some extra information to the error message to help
guide the investigation.

Informs: https://github.com/cockroachlabs/support/issues/2184
Release note: none
Epic: none


107912: sql: fix exec+audit logs for BEGIN, COMMIT, SET stmts r=rafiss a=rafiss

Epic: None
Release note (bug fix): Fixed a bug where BEGIN, COMMIT, SET, ROLLBACK, and SAVEPOINT statements would not be written to the execution or audit logs.

108093: jobs: avoid crdb_internal.system_jobs in gc-jobs r=miretskiy a=stevendanna

The crdb_internal.system_jobs is a virtual table that joins information from the jobs table and the jobs_info table.

For the previous query,

    SELECT id, payload, status FROM "".crdb_internal.system_jobs
    WHERE (created < $1) AND (id > $2)
    ORDER BY id
    LIMIT $3

this is a little suboptimal because:

- We don't make use of the progress column so any read of that is useless.

- While the crdb_internal.virtual table has a virtual index on job id, and EXPLAIN will even claim that it will be used:

      • limit
      │ count: 100
      │
      └── • filter
          │ filter: created < '2023-07-20 07:29:01.17001'
          │
          └── • virtual table
                table: system_jobs@system_jobs_id_idx
                spans: [/101 - ]

  This is actually a lie. A virtual index can only handle single-key spans. As a result the unconstrained query is used:
```
    WITH
        latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE info_key = 'legacy_payload' ORDER BY written DESC),
        latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WHERE info_key = 'legacy_progress' ORDER BY written DESC)
    SELECT
       DISTINCT(id), status, created, payload.value AS payload, progress.value AS progress,
                created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type
    FROM system.jobs AS j
    INNER JOIN latestpayload AS payload ON j.id = payload.job_id
    LEFT JOIN latestprogress AS progress ON j.id = progress.job_id
```
  which has a full scan of the jobs table and 2 full scans of the info table:

      • distinct
      │ distinct on: id, value, value
      │
      └── • merge join
          │ equality: (job_id) = (id)
          │
          ├── • render
          │   │
          │   └── • filter
          │       │ estimated row count: 7,318
          │       │ filter: info_key = 'legacy_payload'
          │       │
          │       └── • scan
          │             estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
          │             table: job_info@primary
          │             spans: FULL SCAN
          │
          └── • merge join (right outer)
              │ equality: (job_id) = (id)
              │ right cols are key
              │
              ├── • render
              │   │
              │   └── • filter
              │       │ estimated row count: 7,317
              │       │ filter: info_key = 'legacy_progress'
              │       │
              │       └── • scan
              │             estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
              │             table: job_info@primary
              │             spans: FULL SCAN
              │
              └── • scan
                    missing stats
                    table: jobs@primary
                    spans: FULL SCAN

  Because of the limit, I don't think this ends up being as bad as it looks. But it still isn't great.

In this PR, we replace crdb_internal.jobs with a query that removes the join on the unused progress field and also constrains the query of the job_info table.

      • distinct
      │ distinct on: id, value
      │
      └── • merge join
          │ equality: (job_id) = (id)
          │ right cols are key
          │
          ├── • render
          │   │
          │   └── • filter
          │       │ estimated row count: 7,318
          │       │ filter: info_key = 'legacy_payload'
          │       │
          │       └── • scan
          │             estimated row count: 14,646 (100% of the table; stats collected 45 minutes ago; using stats forecast for 2 hours in the future)
          │             table: job_info@primary
          │             spans: [/101/'legacy_payload' - ]
          │
          └── • render
              │
              └── • limit
                  │ count: 100
                  │
                  └── • filter
                      │ filter: created < '2023-07-20 07:29:01.17001'
                      │
                      └── • scan
                            missing stats
                            table: jobs@primary
                            spans: [/101 - ]

In a local example, this does seem faster:

    > SELECT id, payload, status, created
    > FROM "".crdb_internal.system_jobs
    > WHERE (created < '2023-07-20 07:29:01.17001') AND (id > 100) ORDER BY id LIMIT 100;

    id | payload | status | created
    -----+---------+--------+----------
    (0 rows)

    Time: 183ms total (execution 183ms / network 0ms)

    > WITH
    > latestpayload AS (
    >     SELECT job_id, value
    >     FROM system.job_info AS payload
    >     WHERE job_id > 100 AND info_key = 'legacy_payload'
    >     ORDER BY written desc
    > ),
    > jobpage AS (
    >     SELECT id, status, created
    >     FROM system.jobs
    >     WHERE (created < '2023-07-20 07:29:01.17001') and (id > 100)
    >     ORDER BY id
    >     LIMIT 100
    > )
    > SELECT distinct (id), latestpayload.value AS payload, status
    > FROM jobpage AS j
    > INNER JOIN latestpayload ON j.id = latestpayload.job_id;
      id | payload | status
    -----+---------+---------
    (0 rows)

    Time: 43ms total (execution 42ms / network 0ms)

Release note: None

Epic: none

108313: cli: debug zip uses read-only range probe r=joshimhoff a=dhartunian

To reduce the chance of corruption issues when run against older clusters, the recently added use of `crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe.

See discussion in #107720

Release note: None
Epic: None

108335: kvcoord: Implement CloseStream for MuxRangeFeed r=miretskiy a=miretskiy

Extend MuxRangeFeed protocol to support explicit,
caller initiated CloseStream operation.

The caller may decide to stop receiving events
for a particular stream, which is part of MuxRangeFeed. The caller may issue a request to MuxRangeFeed server to close the stream.  The server will cancel underlying range feed, and return a `RangeFeedRetryError_REASON_RANGEFEED_CLOSED` error as a response.

Note, current mux rangefeed clinet does not use this request. The code to support cancellation is added pre-emptively in case this functionality will be required in the future to support restarts due to stuck rangefeeds.

Epic: CRDB-26372
Release note: None

108355: roachtest/multitenant-upgrade: hard-code predecessor version temporarily r=knz a=healthy-pod

Hard-code the pre-decessor release to 23.1.4 until a new patch release is out (23.1.9) because the test is in-compatible with 23.1.{5,6,7,8} due to an erroneous PR merged on the 23.1 branch.

Release note: None
Epic: none

108375: changefeedccl: Increase test utility timeout r=miretskiy a=miretskiy

As observed in #108348, a test failed because
it timed out reading a row.  Yet, this flake
could not be reproduced in over 50k runs.
Bump timeout period to make this flake even
less likely.

Fixes #108348

Release note: None

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
@craig craig bot closed this as completed in a1cf907 Aug 8, 2023
blathers-crl bot pushed a commit that referenced this issue Aug 18, 2023
As observed in #108348, a test failed because
it timed out reading a row.  Yet, this flake
could not be reproduced in over 50k runs.
Bump timeout period to make this flake even
less likely.

Fixes #108348

Release note: None
blathers-crl bot pushed a commit that referenced this issue Aug 18, 2023
As observed in #108348, a test failed because
it timed out reading a row.  Yet, this flake
could not be reproduced in over 50k runs.
Bump timeout period to make this flake even
less likely.

Fixes #108348

Release note: None
blathers-crl bot pushed a commit that referenced this issue Aug 18, 2023
As observed in #108348, a test failed because
it timed out reading a row.  Yet, this flake
could not be reproduced in over 50k runs.
Bump timeout period to make this flake even
less likely.

Fixes #108348

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants