cli: add probe_range in debug.zip #107720

j82w · 2023-07-27T15:00:31Z

PR #79546 introduces crdb_internal.probe_range. This PR adds the crdb_internal.probe_range to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long.

closes: #80360

Release note (cli change): The debug.zip now includes the crdb_internal.probe_range table with a limit of 100 rows to avoid the query from taking to long.

blathers-crl · 2023-07-27T15:00:37Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-07-27T15:00:43Z

This change is

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)

pkg/cli/zip_table_registry.go line 232 at r1 (raw file):

	},
	"crdb_internal.probe_ranges_1s_write_limit_100": {
		customQueryRedacted: `SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY end_to_end_latency DESC LIMIT 100;

is this right? I'm on master and getting:

[email protected]:26257/movr> SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY
                        -> end_to_end_latency DESC LIMIT 100;
ERROR: column "end_to_end_latency" does not exist

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)

pkg/cli/zip_table_registry.go line 232 at r1 (raw file):

	},
	"crdb_internal.probe_ranges_1s_write_limit_100": {
		customQueryRedacted: `SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY end_to_end_latency DESC LIMIT 100;

I don't see any columns getting marked as sensitive for this table, so you can use customQueryUnredacted here too

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)

pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

		FROM crdb_internal.cluster_settings`,
	},
	"crdb_internal.probe_ranges_1s_write_limit_100": {

I believe this should be just the name of the table, with no extra info, so no 1s_write_limit_100

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @maryliag)

pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

I believe this should be just the name of the table, with no extra info, so no 1s_write_limit_100

In the case where no custom queries are defined, this table name is used to construct the query, so it indeed has to be the name of the actual table.

But, if you use custom queries, I don't think it has to be a real table name, because we return the custom query instead. In this case, the name would just be used as the filename that the output is written to.

But in that case, as noted in another comment below, you also need to define a customQueryUnredacted. Otherwise, the name will be used to make a TABLE query, which will error if the name doesn't match any real table.

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @maryliag)

pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

In the case where no custom queries are defined, this table name is used to construct the query, so it indeed has to be the name of the actual table.

But, if you use custom queries, I don't think it has to be a real table name, because we return the custom query instead. In this case, the name would just be used as the filename that the output is written to.

But in that case, as noted in another comment below, you also need to define a customQueryUnredacted. Otherwise, the name will be used to make a TABLE query, which will error if the name doesn't match any real table.

I did this on purpose to avoid confusion. Almost all the other tables in debug zip are select * without any constraints. People will assume that is the case for this information too which will cause confusion. Once they realize it's limited they will want to know how and by how much. I figured putting it in the table name made it very clear to everyone looking at the file what was actually in it.

pkg/cli/zip_table_registry.go line 232 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is this right? I'm on master and getting:

[email protected]:26257/movr> SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY
                        -> end_to_end_latency DESC LIMIT 100;
ERROR: column "end_to_end_latency" does not exist

I meant to upload the PR as a draft.

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @maryliag)

pkg/cli/zip_table_registry.go line 231 at r1 (raw file):

Previously, j82w (Jake) wrote…

I did this on purpose to avoid confusion. Almost all the other tables in debug zip are select * without any constraints. People will assume that is the case for this information too which will cause confusion. Once they realize it's limited they will want to know how and by how much. I figured putting it in the table name made it very clear to everyone looking at the file what was actually in it.

Gotcha! To clarify, I'm OK with that approach, we'll just need the customQueryUnredacted in that case 👍

PR #79546 introduces `crdb_internal.probe_range`. This PR adds the `crdb_internal.probe_range` to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long. closes: #80360 Release note (cli change): The debug.zip now includes the `crdb_internal.probe_range` table with a limit of 100 rows to avoid the query from taking to long.

maryliag

Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w)

j82w · 2023-07-28T17:11:26Z

bors r+

craig · 2023-07-28T17:31:04Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2023-07-28T18:26:00Z

Build succeeded:

Bazel Essential CI (Cockroach)

joshimhoff · 2023-07-28T19:10:23Z

I bet y'all are aware but probe_ranges recently had an egregious bug: #101549

Can I make a few requests, based on the postmortem that came out of ^^ (https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/2985951249)?

Can you get approval from KV regarding this PR? Probing is dangerous. Want to keep them in loop as owner of system under probe.
If backports, can you send review to KV?
Can you discuss any changes to system tests that we may need to make with KV too? This was a postmortem AI from the linked PM.

Thank you for considering!

nvanbenschoten · 2023-08-03T15:50:11Z

Thanks for bringing this to attention @joshimhoff. The change doesn't strike me as inherently dangerous, but I think this may be the first time we are introducing writes into the debug.zip process, meaning that it will now mutate the cluster that it is meant to observe. Is this risky? Maybe. Could it change the state of the cluster in ways that hide the artifacts we are trying to introspect? Maybe.

Did we consider using a read-only probe to start? Are there other parts of debug.zip which write to the cluster?

Independently, I don't know how debug.zip works, but I see that it runs SQL. Is the SQL query guaranteed to be executed by a node with a fix to #101549 if run in a mixed version cluster?

joshimhoff · 2023-08-03T16:21:56Z

Independently, I don't know how debug.zip works, but I see that it runs SQL. Is the SQL query guaranteed to be executed by a node with a fix to #101549 if run in a mixed version cluster?

Great question. Isn't this debug zip code run on the CLI side? I think it is sorta common to run a new cockroach CLI against an old server.

Moving to a read-only probe is a cheap fix to avoid needing to check that #101549 is running on the server side.

If we want to do write probes, I guess we could do a server version check somehow.

My soft preference is to do read-only probes for now.

joshimhoff · 2023-08-07T14:49:15Z

Friendly poke, @j82w.

j82w · 2023-08-07T18:01:14Z

The read-only probe seems like a good idea. I'm not aware of other parts of debug.zip that writes to the cluster. I was just trying to improve observability by closing the issue: #80360. I wasn't aware there was a read-only option. Can you file an issue or send a PR to switch it to read-only?

dhartunian · 2023-08-07T19:02:24Z

@j82w to clarify a bit why Josh is concerned here:

The blast radius of the bug that led to #101549 was more or less contained because we only triggered that builtin ourselves on managed clusters. By running it as part of debug zip we expose all customers on older versions to this bug. Moreover, this would be via a command that we often tell customers to run. Could result in a bad outcome if someone uses a new CLI build to run on an older self-hosted cluster.

To reduce the chance of corruption issues when run against older clusters, the recently added use of 'crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe. See discussion in cockroachdb#107720 Release note: None Epic: None

dhartunian · 2023-08-07T19:44:05Z

I put up a PR to modify: #108313

j82w · 2023-08-07T19:54:09Z

Thanks for the PR. I agree with your concerns. I'm not that familiar with this code, so I was looking for a consensus on if the correct solution was to revert the PR or was switch it to reads.

To reduce the chance of corruption issues when run against older clusters, the recently added use of 'crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe. See discussion in cockroachdb#107720 Release note: None Epic: None

106130: sql: add extra information to protocol errors in bind r=rafiss a=cucaroach A user is running into mysterious protocol errors when using prepared statements. Add some extra information to the error message to help guide the investigation. Informs: https://github.com/cockroachlabs/support/issues/2184 Release note: none Epic: none 107912: sql: fix exec+audit logs for BEGIN, COMMIT, SET stmts r=rafiss a=rafiss Epic: None Release note (bug fix): Fixed a bug where BEGIN, COMMIT, SET, ROLLBACK, and SAVEPOINT statements would not be written to the execution or audit logs. 108093: jobs: avoid crdb_internal.system_jobs in gc-jobs r=miretskiy a=stevendanna The crdb_internal.system_jobs is a virtual table that joins information from the jobs table and the jobs_info table. For the previous query, SELECT id, payload, status FROM "".crdb_internal.system_jobs WHERE (created < $1) AND (id > $2) ORDER BY id LIMIT $3 this is a little suboptimal because: - We don't make use of the progress column so any read of that is useless. - While the crdb_internal.virtual table has a virtual index on job id, and EXPLAIN will even claim that it will be used: • limit │ count: 100 │ └── • filter │ filter: created < '2023-07-20 07:29:01.17001' │ └── • virtual table table: system_jobs@system_jobs_id_idx spans: [/101 - ] This is actually a lie. A virtual index can only handle single-key spans. As a result the unconstrained query is used: ``` WITH latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE info_key = 'legacy_payload' ORDER BY written DESC), latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WHERE info_key = 'legacy_progress' ORDER BY written DESC) SELECT DISTINCT(id), status, created, payload.value AS payload, progress.value AS progress, created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type FROM system.jobs AS j INNER JOIN latestpayload AS payload ON j.id = payload.job_id LEFT JOIN latestprogress AS progress ON j.id = progress.job_id ``` which has a full scan of the jobs table and 2 full scans of the info table: • distinct │ distinct on: id, value, value │ └── • merge join │ equality: (job_id) = (id) │ ├── • render │ │ │ └── • filter │ │ estimated row count: 7,318 │ │ filter: info_key = 'legacy_payload' │ │ │ └── • scan │ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future) │ table: job_info@primary │ spans: FULL SCAN │ └── • merge join (right outer) │ equality: (job_id) = (id) │ right cols are key │ ├── • render │ │ │ └── • filter │ │ estimated row count: 7,317 │ │ filter: info_key = 'legacy_progress' │ │ │ └── • scan │ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future) │ table: job_info@primary │ spans: FULL SCAN │ └── • scan missing stats table: jobs@primary spans: FULL SCAN Because of the limit, I don't think this ends up being as bad as it looks. But it still isn't great. In this PR, we replace crdb_internal.jobs with a query that removes the join on the unused progress field and also constrains the query of the job_info table. • distinct │ distinct on: id, value │ └── • merge join │ equality: (job_id) = (id) │ right cols are key │ ├── • render │ │ │ └── • filter │ │ estimated row count: 7,318 │ │ filter: info_key = 'legacy_payload' │ │ │ └── • scan │ estimated row count: 14,646 (100% of the table; stats collected 45 minutes ago; using stats forecast for 2 hours in the future) │ table: job_info@primary │ spans: [/101/'legacy_payload' - ] │ └── • render │ └── • limit │ count: 100 │ └── • filter │ filter: created < '2023-07-20 07:29:01.17001' │ └── • scan missing stats table: jobs@primary spans: [/101 - ] In a local example, this does seem faster: > SELECT id, payload, status, created > FROM "".crdb_internal.system_jobs > WHERE (created < '2023-07-20 07:29:01.17001') AND (id > 100) ORDER BY id LIMIT 100; id | payload | status | created -----+---------+--------+---------- (0 rows) Time: 183ms total (execution 183ms / network 0ms) > WITH > latestpayload AS ( > SELECT job_id, value > FROM system.job_info AS payload > WHERE job_id > 100 AND info_key = 'legacy_payload' > ORDER BY written desc > ), > jobpage AS ( > SELECT id, status, created > FROM system.jobs > WHERE (created < '2023-07-20 07:29:01.17001') and (id > 100) > ORDER BY id > LIMIT 100 > ) > SELECT distinct (id), latestpayload.value AS payload, status > FROM jobpage AS j > INNER JOIN latestpayload ON j.id = latestpayload.job_id; id | payload | status -----+---------+--------- (0 rows) Time: 43ms total (execution 42ms / network 0ms) Release note: None Epic: none 108313: cli: debug zip uses read-only range probe r=joshimhoff a=dhartunian To reduce the chance of corruption issues when run against older clusters, the recently added use of `crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe. See discussion in #107720 Release note: None Epic: None 108335: kvcoord: Implement CloseStream for MuxRangeFeed r=miretskiy a=miretskiy Extend MuxRangeFeed protocol to support explicit, caller initiated CloseStream operation. The caller may decide to stop receiving events for a particular stream, which is part of MuxRangeFeed. The caller may issue a request to MuxRangeFeed server to close the stream. The server will cancel underlying range feed, and return a `RangeFeedRetryError_REASON_RANGEFEED_CLOSED` error as a response. Note, current mux rangefeed clinet does not use this request. The code to support cancellation is added pre-emptively in case this functionality will be required in the future to support restarts due to stuck rangefeeds. Epic: CRDB-26372 Release note: None 108355: roachtest/multitenant-upgrade: hard-code predecessor version temporarily r=knz a=healthy-pod Hard-code the pre-decessor release to 23.1.4 until a new patch release is out (23.1.9) because the test is in-compatible with 23.1.{5,6,7,8} due to an erroneous PR merged on the 23.1 branch. Release note: None Epic: none 108375: changefeedccl: Increase test utility timeout r=miretskiy a=miretskiy As observed in #108348, a test failed because it timed out reading a row. Yet, this flake could not be reproduced in over 50k runs. Bump timeout period to make this flake even less likely. Fixes #108348 Release note: None Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: healthy-pod <[email protected]>

j82w requested review from a team July 27, 2023 15:00

j82w requested a review from a team as a code owner July 27, 2023 15:00

j82w requested review from abarganier and removed request for a team July 27, 2023 15:00

maryliag reviewed Jul 27, 2023

View reviewed changes

j82w marked this pull request as draft July 27, 2023 17:46

abarganier reviewed Jul 27, 2023

View reviewed changes

j82w commented Jul 27, 2023

View reviewed changes

abarganier reviewed Jul 27, 2023

View reviewed changes

j82w marked this pull request as ready for review July 27, 2023 22:40

maryliag approved these changes Jul 28, 2023

View reviewed changes

craig bot merged commit 04c91a5 into cockroachdb:master Jul 28, 2023

cockroach-teamcity mentioned this pull request Jul 29, 2023

PR #107720 - cli: add probe_range in debug.zip cockroachdb/docs#17579

Open

dhartunian mentioned this pull request Aug 7, 2023

cli: debug zip uses read-only range probe #108313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: add probe_range in debug.zip #107720

cli: add probe_range in debug.zip #107720

j82w commented Jul 27, 2023

blathers-crl bot commented Jul 27, 2023

cockroach-teamcity commented Jul 27, 2023

maryliag left a comment

maryliag left a comment

maryliag left a comment

abarganier left a comment

j82w left a comment

abarganier left a comment

maryliag left a comment

j82w commented Jul 28, 2023

craig bot commented Jul 28, 2023

craig bot commented Jul 28, 2023

joshimhoff commented Jul 28, 2023

nvanbenschoten commented Aug 3, 2023

joshimhoff commented Aug 3, 2023 •

edited

Loading

joshimhoff commented Aug 7, 2023

j82w commented Aug 7, 2023

dhartunian commented Aug 7, 2023

dhartunian commented Aug 7, 2023

j82w commented Aug 7, 2023

cli: add probe_range in debug.zip #107720

cli: add probe_range in debug.zip #107720

Conversation

j82w commented Jul 27, 2023

blathers-crl bot commented Jul 27, 2023

cockroach-teamcity commented Jul 27, 2023

maryliag left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

j82w left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

j82w commented Jul 28, 2023

craig bot commented Jul 28, 2023

craig bot commented Jul 28, 2023

joshimhoff commented Jul 28, 2023

nvanbenschoten commented Aug 3, 2023

joshimhoff commented Aug 3, 2023 • edited Loading

joshimhoff commented Aug 7, 2023

j82w commented Aug 7, 2023

dhartunian commented Aug 7, 2023

dhartunian commented Aug 7, 2023

j82w commented Aug 7, 2023

joshimhoff commented Aug 3, 2023 •

edited

Loading