Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add probe_range in debug.zip #107720

Merged
merged 1 commit into from
Jul 28, 2023
Merged

cli: add probe_range in debug.zip #107720

merged 1 commit into from
Jul 28, 2023

Conversation

j82w
Copy link
Contributor

@j82w j82w commented Jul 27, 2023

PR #79546 introduces crdb_internal.probe_range. This PR adds the crdb_internal.probe_range to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long.

closes: #80360

Release note (cli change): The debug.zip now includes the crdb_internal.probe_range table with a limit of 100 rows to avoid the query from taking to long.

@j82w j82w requested review from a team July 27, 2023 15:00
@j82w j82w requested a review from a team as a code owner July 27, 2023 15:00
@j82w j82w requested review from abarganier and removed request for a team July 27, 2023 15:00
@blathers-crl
Copy link

blathers-crl bot commented Jul 27, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)


pkg/cli/zip_table_registry.go line 232 at r1 (raw file):

	},
	"crdb_internal.probe_ranges_1s_write_limit_100": {
		customQueryRedacted: `SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY end_to_end_latency DESC LIMIT 100;

is this right? I'm on master and getting:

[email protected]:26257/movr> SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY
                        -> end_to_end_latency DESC LIMIT 100;
ERROR: column "end_to_end_latency" does not exist

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)


pkg/cli/zip_table_registry.go line 232 at r1 (raw file):

	},
	"crdb_internal.probe_ranges_1s_write_limit_100": {
		customQueryRedacted: `SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY end_to_end_latency DESC LIMIT 100;

I don't see any columns getting marked as sensitive for this table, so you can use customQueryUnredacted here too

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)


pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

		FROM crdb_internal.cluster_settings`,
	},
	"crdb_internal.probe_ranges_1s_write_limit_100": {

I believe this should be just the name of the table, with no extra info, so no 1s_write_limit_100

@j82w j82w marked this pull request as draft July 27, 2023 17:46
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @maryliag)


pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

I believe this should be just the name of the table, with no extra info, so no 1s_write_limit_100

In the case where no custom queries are defined, this table name is used to construct the query, so it indeed has to be the name of the actual table.

But, if you use custom queries, I don't think it has to be a real table name, because we return the custom query instead. In this case, the name would just be used as the filename that the output is written to.

But in that case, as noted in another comment below, you also need to define a customQueryUnredacted. Otherwise, the name will be used to make a TABLE query, which will error if the name doesn't match any real table.

Copy link
Contributor Author

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @maryliag)


pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

In the case where no custom queries are defined, this table name is used to construct the query, so it indeed has to be the name of the actual table.

But, if you use custom queries, I don't think it has to be a real table name, because we return the custom query instead. In this case, the name would just be used as the filename that the output is written to.

But in that case, as noted in another comment below, you also need to define a customQueryUnredacted. Otherwise, the name will be used to make a TABLE query, which will error if the name doesn't match any real table.

I did this on purpose to avoid confusion. Almost all the other tables in debug zip are select * without any constraints. People will assume that is the case for this information too which will cause confusion. Once they realize it's limited they will want to know how and by how much. I figured putting it in the table name made it very clear to everyone looking at the file what was actually in it.


pkg/cli/zip_table_registry.go line 232 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is this right? I'm on master and getting:

[email protected]:26257/movr> SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY
                        -> end_to_end_latency DESC LIMIT 100;
ERROR: column "end_to_end_latency" does not exist

I meant to upload the PR as a draft.

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @maryliag)


pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

Previously, j82w (Jake) wrote…

I did this on purpose to avoid confusion. Almost all the other tables in debug zip are select * without any constraints. People will assume that is the case for this information too which will cause confusion. Once they realize it's limited they will want to know how and by how much. I figured putting it in the table name made it very clear to everyone looking at the file what was actually in it.

Gotcha! To clarify, I'm OK with that approach, we'll just need the customQueryUnredacted in that case 👍

@j82w j82w marked this pull request as ready for review July 27, 2023 22:40
PR #79546 introduces `crdb_internal.probe_range`. This PR adds the
`crdb_internal.probe_range` to the debug.zip. The LIMIT gives a very
approximately ~1000ms*100 target on how long this can take, so that
running debug.zip against an unavailable cluster won't take too long.

closes: #80360

Release note (cli change): The debug.zip now includes the
`crdb_internal.probe_range` table with a limit of 100 rows to avoid the
query from taking to long.
Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w)

@j82w
Copy link
Contributor Author

j82w commented Jul 28, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 28, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@craig craig bot merged commit 04c91a5 into cockroachdb:master Jul 28, 2023
@craig
Copy link
Contributor

craig bot commented Jul 28, 2023

Build succeeded:

@joshimhoff
Copy link
Collaborator

I bet y'all are aware but probe_ranges recently had an egregious bug: #101549

Can I make a few requests, based on the postmortem that came out of ^^ (https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/2985951249)?

  1. Can you get approval from KV regarding this PR? Probing is dangerous. Want to keep them in loop as owner of system under probe.
  2. If backports, can you send review to KV?
  3. Can you discuss any changes to system tests that we may need to make with KV too? This was a postmortem AI from the linked PM.

Thank you for considering!

@nvanbenschoten
Copy link
Member

Thanks for bringing this to attention @joshimhoff. The change doesn't strike me as inherently dangerous, but I think this may be the first time we are introducing writes into the debug.zip process, meaning that it will now mutate the cluster that it is meant to observe. Is this risky? Maybe. Could it change the state of the cluster in ways that hide the artifacts we are trying to introspect? Maybe.

Did we consider using a read-only probe to start? Are there other parts of debug.zip which write to the cluster?

Independently, I don't know how debug.zip works, but I see that it runs SQL. Is the SQL query guaranteed to be executed by a node with a fix to #101549 if run in a mixed version cluster?

@joshimhoff
Copy link
Collaborator

joshimhoff commented Aug 3, 2023

Independently, I don't know how debug.zip works, but I see that it runs SQL. Is the SQL query guaranteed to be executed by a node with a fix to #101549 if run in a mixed version cluster?

Great question. Isn't this debug zip code run on the CLI side? I think it is sorta common to run a new cockroach CLI against an old server.

Moving to a read-only probe is a cheap fix to avoid needing to check that #101549 is running on the server side.

If we want to do write probes, I guess we could do a server version check somehow.

My soft preference is to do read-only probes for now.

@joshimhoff
Copy link
Collaborator

Friendly poke, @j82w.

@j82w
Copy link
Contributor Author

j82w commented Aug 7, 2023

The read-only probe seems like a good idea. I'm not aware of other parts of debug.zip that writes to the cluster. I was just trying to improve observability by closing the issue: #80360. I wasn't aware there was a read-only option. Can you file an issue or send a PR to switch it to read-only?

@dhartunian
Copy link
Collaborator

@j82w to clarify a bit why Josh is concerned here:

The blast radius of the bug that led to #101549 was more or less contained because we only triggered that builtin ourselves on managed clusters. By running it as part of debug zip we expose all customers on older versions to this bug. Moreover, this would be via a command that we often tell customers to run. Could result in a bad outcome if someone uses a new CLI build to run on an older self-hosted cluster.

dhartunian added a commit to dhartunian/cockroach that referenced this pull request Aug 7, 2023
To reduce the chance of corruption issues when run against older
clusters, the recently added use of 'crdb_internal.probe_ranges` is
modified to use a `read` probe instead of a `write` probe.

See discussion in cockroachdb#107720

Release note: None
Epic: None
@dhartunian
Copy link
Collaborator

I put up a PR to modify: #108313

@j82w
Copy link
Contributor Author

j82w commented Aug 7, 2023

Thanks for the PR. I agree with your concerns. I'm not that familiar with this code, so I was looking for a consensus on if the correct solution was to revert the PR or was switch it to reads.

dhartunian added a commit to dhartunian/cockroach that referenced this pull request Aug 7, 2023
To reduce the chance of corruption issues when run against older
clusters, the recently added use of 'crdb_internal.probe_ranges` is
modified to use a `read` probe instead of a `write` probe.

See discussion in cockroachdb#107720

Release note: None
Epic: None
craig bot pushed a commit that referenced this pull request Aug 8, 2023
106130: sql: add extra information to protocol errors in bind r=rafiss a=cucaroach

A user is running into mysterious protocol errors when using prepared
statements. Add some extra information to the error message to help
guide the investigation.

Informs: https://github.com/cockroachlabs/support/issues/2184
Release note: none
Epic: none


107912: sql: fix exec+audit logs for BEGIN, COMMIT, SET stmts r=rafiss a=rafiss

Epic: None
Release note (bug fix): Fixed a bug where BEGIN, COMMIT, SET, ROLLBACK, and SAVEPOINT statements would not be written to the execution or audit logs.

108093: jobs: avoid crdb_internal.system_jobs in gc-jobs r=miretskiy a=stevendanna

The crdb_internal.system_jobs is a virtual table that joins information from the jobs table and the jobs_info table.

For the previous query,

    SELECT id, payload, status FROM "".crdb_internal.system_jobs
    WHERE (created < $1) AND (id > $2)
    ORDER BY id
    LIMIT $3

this is a little suboptimal because:

- We don't make use of the progress column so any read of that is useless.

- While the crdb_internal.virtual table has a virtual index on job id, and EXPLAIN will even claim that it will be used:

      • limit
      │ count: 100
      │
      └── • filter
          │ filter: created < '2023-07-20 07:29:01.17001'
          │
          └── • virtual table
                table: system_jobs@system_jobs_id_idx
                spans: [/101 - ]

  This is actually a lie. A virtual index can only handle single-key spans. As a result the unconstrained query is used:
```
    WITH
        latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE info_key = 'legacy_payload' ORDER BY written DESC),
        latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WHERE info_key = 'legacy_progress' ORDER BY written DESC)
    SELECT
       DISTINCT(id), status, created, payload.value AS payload, progress.value AS progress,
                created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type
    FROM system.jobs AS j
    INNER JOIN latestpayload AS payload ON j.id = payload.job_id
    LEFT JOIN latestprogress AS progress ON j.id = progress.job_id
```
  which has a full scan of the jobs table and 2 full scans of the info table:

      • distinct
      │ distinct on: id, value, value
      │
      └── • merge join
          │ equality: (job_id) = (id)
          │
          ├── • render
          │   │
          │   └── • filter
          │       │ estimated row count: 7,318
          │       │ filter: info_key = 'legacy_payload'
          │       │
          │       └── • scan
          │             estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
          │             table: job_info@primary
          │             spans: FULL SCAN
          │
          └── • merge join (right outer)
              │ equality: (job_id) = (id)
              │ right cols are key
              │
              ├── • render
              │   │
              │   └── • filter
              │       │ estimated row count: 7,317
              │       │ filter: info_key = 'legacy_progress'
              │       │
              │       └── • scan
              │             estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
              │             table: job_info@primary
              │             spans: FULL SCAN
              │
              └── • scan
                    missing stats
                    table: jobs@primary
                    spans: FULL SCAN

  Because of the limit, I don't think this ends up being as bad as it looks. But it still isn't great.

In this PR, we replace crdb_internal.jobs with a query that removes the join on the unused progress field and also constrains the query of the job_info table.

      • distinct
      │ distinct on: id, value
      │
      └── • merge join
          │ equality: (job_id) = (id)
          │ right cols are key
          │
          ├── • render
          │   │
          │   └── • filter
          │       │ estimated row count: 7,318
          │       │ filter: info_key = 'legacy_payload'
          │       │
          │       └── • scan
          │             estimated row count: 14,646 (100% of the table; stats collected 45 minutes ago; using stats forecast for 2 hours in the future)
          │             table: job_info@primary
          │             spans: [/101/'legacy_payload' - ]
          │
          └── • render
              │
              └── • limit
                  │ count: 100
                  │
                  └── • filter
                      │ filter: created < '2023-07-20 07:29:01.17001'
                      │
                      └── • scan
                            missing stats
                            table: jobs@primary
                            spans: [/101 - ]

In a local example, this does seem faster:

    > SELECT id, payload, status, created
    > FROM "".crdb_internal.system_jobs
    > WHERE (created < '2023-07-20 07:29:01.17001') AND (id > 100) ORDER BY id LIMIT 100;

    id | payload | status | created
    -----+---------+--------+----------
    (0 rows)

    Time: 183ms total (execution 183ms / network 0ms)

    > WITH
    > latestpayload AS (
    >     SELECT job_id, value
    >     FROM system.job_info AS payload
    >     WHERE job_id > 100 AND info_key = 'legacy_payload'
    >     ORDER BY written desc
    > ),
    > jobpage AS (
    >     SELECT id, status, created
    >     FROM system.jobs
    >     WHERE (created < '2023-07-20 07:29:01.17001') and (id > 100)
    >     ORDER BY id
    >     LIMIT 100
    > )
    > SELECT distinct (id), latestpayload.value AS payload, status
    > FROM jobpage AS j
    > INNER JOIN latestpayload ON j.id = latestpayload.job_id;
      id | payload | status
    -----+---------+---------
    (0 rows)

    Time: 43ms total (execution 42ms / network 0ms)

Release note: None

Epic: none

108313: cli: debug zip uses read-only range probe r=joshimhoff a=dhartunian

To reduce the chance of corruption issues when run against older clusters, the recently added use of `crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe.

See discussion in #107720

Release note: None
Epic: None

108335: kvcoord: Implement CloseStream for MuxRangeFeed r=miretskiy a=miretskiy

Extend MuxRangeFeed protocol to support explicit,
caller initiated CloseStream operation.

The caller may decide to stop receiving events
for a particular stream, which is part of MuxRangeFeed. The caller may issue a request to MuxRangeFeed server to close the stream.  The server will cancel underlying range feed, and return a `RangeFeedRetryError_REASON_RANGEFEED_CLOSED` error as a response.

Note, current mux rangefeed clinet does not use this request. The code to support cancellation is added pre-emptively in case this functionality will be required in the future to support restarts due to stuck rangefeeds.

Epic: CRDB-26372
Release note: None

108355: roachtest/multitenant-upgrade: hard-code predecessor version temporarily r=knz a=healthy-pod

Hard-code the pre-decessor release to 23.1.4 until a new patch release is out (23.1.9) because the test is in-compatible with 23.1.{5,6,7,8} due to an erroneous PR merged on the 23.1 branch.

Release note: None
Epic: none

108375: changefeedccl: Increase test utility timeout r=miretskiy a=miretskiy

As observed in #108348, a test failed because
it timed out reading a row.  Yet, this flake
could not be reproduced in over 50k runs.
Bump timeout period to make this flake even
less likely.

Fixes #108348

Release note: None

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: include probe_range in debug.zip
7 participants