-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: add probe_range in debug.zip #107720
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)
pkg/cli/zip_table_registry.go
line 232 at r1 (raw file):
}, "crdb_internal.probe_ranges_1s_write_limit_100": { customQueryRedacted: `SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY end_to_end_latency DESC LIMIT 100;
is this right? I'm on master and getting:
[email protected]:26257/movr> SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY
-> end_to_end_latency DESC LIMIT 100;
ERROR: column "end_to_end_latency" does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)
pkg/cli/zip_table_registry.go
line 232 at r1 (raw file):
}, "crdb_internal.probe_ranges_1s_write_limit_100": { customQueryRedacted: `SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY end_to_end_latency DESC LIMIT 100;
I don't see any columns getting marked as sensitive for this table, so you can use customQueryUnredacted
here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @j82w)
pkg/cli/zip_table_registry.go
line 231 at r1 (raw file):
FROM crdb_internal.cluster_settings`, }, "crdb_internal.probe_ranges_1s_write_limit_100": {
I believe this should be just the name of the table, with no extra info, so no 1s_write_limit_100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @maryliag)
pkg/cli/zip_table_registry.go
line 231 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
I believe this should be just the name of the table, with no extra info, so no
1s_write_limit_100
In the case where no custom queries are defined, this table name is used to construct the query, so it indeed has to be the name of the actual table.
But, if you use custom queries, I don't think it has to be a real table name, because we return the custom query instead. In this case, the name would just be used as the filename that the output is written to.
But in that case, as noted in another comment below, you also need to define a customQueryUnredacted
. Otherwise, the name will be used to make a TABLE
query, which will error if the name doesn't match any real table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @maryliag)
pkg/cli/zip_table_registry.go
line 231 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
In the case where no custom queries are defined, this table name is used to construct the query, so it indeed has to be the name of the actual table.
But, if you use custom queries, I don't think it has to be a real table name, because we return the custom query instead. In this case, the name would just be used as the filename that the output is written to.
But in that case, as noted in another comment below, you also need to define a
customQueryUnredacted
. Otherwise, the name will be used to make aTABLE
query, which will error if the name doesn't match any real table.
I did this on purpose to avoid confusion. Almost all the other tables in debug zip are select * without any constraints. People will assume that is the case for this information too which will cause confusion. Once they realize it's limited they will want to know how and by how much. I figured putting it in the table name made it very clear to everyone looking at the file what was actually in it.
pkg/cli/zip_table_registry.go
line 232 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
is this right? I'm on master and getting:
[email protected]:26257/movr> SELECT count(1) FROM crdb_internal.probe_ranges(INTERVAL '1000ms', 'write') WHERE error != '' ORDER BY -> end_to_end_latency DESC LIMIT 100; ERROR: column "end_to_end_latency" does not exist
I meant to upload the PR as a draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @maryliag)
pkg/cli/zip_table_registry.go
line 231 at r1 (raw file):
Previously, j82w (Jake) wrote…
I did this on purpose to avoid confusion. Almost all the other tables in debug zip are select * without any constraints. People will assume that is the case for this information too which will cause confusion. Once they realize it's limited they will want to know how and by how much. I figured putting it in the table name made it very clear to everyone looking at the file what was actually in it.
Gotcha! To clarify, I'm OK with that approach, we'll just need the customQueryUnredacted
in that case 👍
PR #79546 introduces `crdb_internal.probe_range`. This PR adds the `crdb_internal.probe_range` to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long. closes: #80360 Release note (cli change): The debug.zip now includes the `crdb_internal.probe_range` table with a limit of 100 rows to avoid the query from taking to long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w)
bors r+ |
This PR was included in a batch that was canceled, it will be automatically retried |
Build succeeded: |
I bet y'all are aware but Can I make a few requests, based on the postmortem that came out of ^^ (https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/2985951249)?
Thank you for considering! |
Thanks for bringing this to attention @joshimhoff. The change doesn't strike me as inherently dangerous, but I think this may be the first time we are introducing writes into the Did we consider using a read-only probe to start? Are there other parts of Independently, I don't know how debug.zip works, but I see that it runs SQL. Is the SQL query guaranteed to be executed by a node with a fix to #101549 if run in a mixed version cluster? |
Great question. Isn't this debug zip code run on the CLI side? I think it is sorta common to run a new Moving to a read-only probe is a cheap fix to avoid needing to check that #101549 is running on the server side. If we want to do write probes, I guess we could do a server version check somehow. My soft preference is to do read-only probes for now. |
Friendly poke, @j82w. |
The read-only probe seems like a good idea. I'm not aware of other parts of |
@j82w to clarify a bit why Josh is concerned here: The blast radius of the bug that led to #101549 was more or less contained because we only triggered that builtin ourselves on managed clusters. By running it as part of debug zip we expose all customers on older versions to this bug. Moreover, this would be via a command that we often tell customers to run. Could result in a bad outcome if someone uses a new CLI build to run on an older self-hosted cluster. |
To reduce the chance of corruption issues when run against older clusters, the recently added use of 'crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe. See discussion in cockroachdb#107720 Release note: None Epic: None
I put up a PR to modify: #108313 |
Thanks for the PR. I agree with your concerns. I'm not that familiar with this code, so I was looking for a consensus on if the correct solution was to revert the PR or was switch it to reads. |
To reduce the chance of corruption issues when run against older clusters, the recently added use of 'crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe. See discussion in cockroachdb#107720 Release note: None Epic: None
106130: sql: add extra information to protocol errors in bind r=rafiss a=cucaroach A user is running into mysterious protocol errors when using prepared statements. Add some extra information to the error message to help guide the investigation. Informs: https://github.com/cockroachlabs/support/issues/2184 Release note: none Epic: none 107912: sql: fix exec+audit logs for BEGIN, COMMIT, SET stmts r=rafiss a=rafiss Epic: None Release note (bug fix): Fixed a bug where BEGIN, COMMIT, SET, ROLLBACK, and SAVEPOINT statements would not be written to the execution or audit logs. 108093: jobs: avoid crdb_internal.system_jobs in gc-jobs r=miretskiy a=stevendanna The crdb_internal.system_jobs is a virtual table that joins information from the jobs table and the jobs_info table. For the previous query, SELECT id, payload, status FROM "".crdb_internal.system_jobs WHERE (created < $1) AND (id > $2) ORDER BY id LIMIT $3 this is a little suboptimal because: - We don't make use of the progress column so any read of that is useless. - While the crdb_internal.virtual table has a virtual index on job id, and EXPLAIN will even claim that it will be used: • limit │ count: 100 │ └── • filter │ filter: created < '2023-07-20 07:29:01.17001' │ └── • virtual table table: system_jobs@system_jobs_id_idx spans: [/101 - ] This is actually a lie. A virtual index can only handle single-key spans. As a result the unconstrained query is used: ``` WITH latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE info_key = 'legacy_payload' ORDER BY written DESC), latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WHERE info_key = 'legacy_progress' ORDER BY written DESC) SELECT DISTINCT(id), status, created, payload.value AS payload, progress.value AS progress, created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type FROM system.jobs AS j INNER JOIN latestpayload AS payload ON j.id = payload.job_id LEFT JOIN latestprogress AS progress ON j.id = progress.job_id ``` which has a full scan of the jobs table and 2 full scans of the info table: • distinct │ distinct on: id, value, value │ └── • merge join │ equality: (job_id) = (id) │ ├── • render │ │ │ └── • filter │ │ estimated row count: 7,318 │ │ filter: info_key = 'legacy_payload' │ │ │ └── • scan │ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future) │ table: job_info@primary │ spans: FULL SCAN │ └── • merge join (right outer) │ equality: (job_id) = (id) │ right cols are key │ ├── • render │ │ │ └── • filter │ │ estimated row count: 7,317 │ │ filter: info_key = 'legacy_progress' │ │ │ └── • scan │ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future) │ table: job_info@primary │ spans: FULL SCAN │ └── • scan missing stats table: jobs@primary spans: FULL SCAN Because of the limit, I don't think this ends up being as bad as it looks. But it still isn't great. In this PR, we replace crdb_internal.jobs with a query that removes the join on the unused progress field and also constrains the query of the job_info table. • distinct │ distinct on: id, value │ └── • merge join │ equality: (job_id) = (id) │ right cols are key │ ├── • render │ │ │ └── • filter │ │ estimated row count: 7,318 │ │ filter: info_key = 'legacy_payload' │ │ │ └── • scan │ estimated row count: 14,646 (100% of the table; stats collected 45 minutes ago; using stats forecast for 2 hours in the future) │ table: job_info@primary │ spans: [/101/'legacy_payload' - ] │ └── • render │ └── • limit │ count: 100 │ └── • filter │ filter: created < '2023-07-20 07:29:01.17001' │ └── • scan missing stats table: jobs@primary spans: [/101 - ] In a local example, this does seem faster: > SELECT id, payload, status, created > FROM "".crdb_internal.system_jobs > WHERE (created < '2023-07-20 07:29:01.17001') AND (id > 100) ORDER BY id LIMIT 100; id | payload | status | created -----+---------+--------+---------- (0 rows) Time: 183ms total (execution 183ms / network 0ms) > WITH > latestpayload AS ( > SELECT job_id, value > FROM system.job_info AS payload > WHERE job_id > 100 AND info_key = 'legacy_payload' > ORDER BY written desc > ), > jobpage AS ( > SELECT id, status, created > FROM system.jobs > WHERE (created < '2023-07-20 07:29:01.17001') and (id > 100) > ORDER BY id > LIMIT 100 > ) > SELECT distinct (id), latestpayload.value AS payload, status > FROM jobpage AS j > INNER JOIN latestpayload ON j.id = latestpayload.job_id; id | payload | status -----+---------+--------- (0 rows) Time: 43ms total (execution 42ms / network 0ms) Release note: None Epic: none 108313: cli: debug zip uses read-only range probe r=joshimhoff a=dhartunian To reduce the chance of corruption issues when run against older clusters, the recently added use of `crdb_internal.probe_ranges` is modified to use a `read` probe instead of a `write` probe. See discussion in #107720 Release note: None Epic: None 108335: kvcoord: Implement CloseStream for MuxRangeFeed r=miretskiy a=miretskiy Extend MuxRangeFeed protocol to support explicit, caller initiated CloseStream operation. The caller may decide to stop receiving events for a particular stream, which is part of MuxRangeFeed. The caller may issue a request to MuxRangeFeed server to close the stream. The server will cancel underlying range feed, and return a `RangeFeedRetryError_REASON_RANGEFEED_CLOSED` error as a response. Note, current mux rangefeed clinet does not use this request. The code to support cancellation is added pre-emptively in case this functionality will be required in the future to support restarts due to stuck rangefeeds. Epic: CRDB-26372 Release note: None 108355: roachtest/multitenant-upgrade: hard-code predecessor version temporarily r=knz a=healthy-pod Hard-code the pre-decessor release to 23.1.4 until a new patch release is out (23.1.9) because the test is in-compatible with 23.1.{5,6,7,8} due to an erroneous PR merged on the 23.1 branch. Release note: None Epic: none 108375: changefeedccl: Increase test utility timeout r=miretskiy a=miretskiy As observed in #108348, a test failed because it timed out reading a row. Yet, this flake could not be reproduced in over 50k runs. Bump timeout period to make this flake even less likely. Fixes #108348 Release note: None Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: healthy-pod <[email protected]>
PR #79546 introduces
crdb_internal.probe_range
. This PR adds thecrdb_internal.probe_range
to the debug.zip. The LIMIT gives a very approximately ~1000ms*100 target on how long this can take, so that running debug.zip against an unavailable cluster won't take too long.closes: #80360
Release note (cli change): The debug.zip now includes the
crdb_internal.probe_range
table with a limit of 100 rows to avoid the query from taking to long.