cli: implement fallback query for transaction_contention_events when statement times out #123964

xinhaoz · 2024-05-10T19:01:54Z

#118478 aimed to add the query texts for transactions in crdb_internal.transaction_contention_events to the debug zip. We've seen this query timeout for clusters experiencing a lot of contention. The join collecting the blocking transaction's query strings results in a full scan of the system.statement_statistics table and should be removed. We can investigate other options or queries that may preserve some of this information.

Update May 16:
We'll do 3 things to improve upon what was added:

Increase default timeout - a couple of of other queries (e.g. cluster locks table) are also timing out, so we should just increaes the default itmeout. Note that there's also a CLI flag for users to set this.
Include a fallback query in addition to the one that was added above that includes the non-expensive information
Improve runtime of the added query by using the index on statement fingerprint id (we'll have to join on the txn stats table to accomplish this)

Jira issue: CRDB-38628

Epic CRDB-35278

The text was updated successfully, but these errors were encountered:

126352: cli: add fallback query support for debug zip r=xinhaoz a=dhartunian Previously, when SQL queries for dumping tables to debug zip would fail, we would have no follow-up. Engineers can now define "fallback" queries for tables in debug zip in order to make a second attempt with a simpler query. Often we want to run a more complex query to gather more debug data but these queries can fail when the cluster is experiencing problems. This change gives us a chance to define a simpler approach that can be attempted when necessary. In order to define a fallback, there are two new optional fields in the `TableRegistryConfig` struct for redacted and unredacted queries respectively. Debug zip output will still include the failed attempts at the original query along with the error message file as before. If a fallback query is defined, that query will produce its own output (and error) file with an additional `.fallback` suffix added to the base table name to identify it. Resolves: #123964 Epic: CRDB-35278 Release note: None 126354: ui: alter role events render correctly r=xinhaoz a=dhartunian Previously, ALTER ROLE events without role options would render with an "undefined" option in the event log on the DB Console. This change amends the rendering logic to correctly render events without any options. Resolves #124871 Epic: None Release note (bug fix,ui change): ALTER ROLE events in the DB Console event log now render correctly when the event does not contain any role options. 126486: kvserver/rangefeed: remove lockedRangefeedStream r=nvanbenschoten a=wenyihu6 **kvserver: wrap kvpb.RangeFeedEventSink in Stream** Previously, we declared the same interface signature twice: once in kvpb.RangeFeedEventSink and again in rangefeed.Stream. This patch embeds kvpb.RangeFeedEventSink inside rangefeed.Stream, making rangefeed.Stream a superset of kvpb.RangeFeedEventSink. This approach makes sense, as each rangefeed server stream should be a rangefeed event sink, capable of making thread-safe rangefeed event sends. Epic: none Release note: none --- **kvserver/rangefeed: remove lockedRangefeedStream** Previously, we created separate locked rangefeed streams for each individual rangefeed stream to ensure Send can be called concurrently as the underlying grpc stream is not thread safe. However, since the introduction of the mux rangefeed support, we already have a dedicated lock for the underlying mux stream, making the Send method on each rangefeed stream thread safe already. This patch removes the redundant locks from each individual rangefeed stream. Epic: none Release note: none 126487: kvserver/rangefeed: remove non-mux rangefeed metrics r=nvanbenschoten a=wenyihu6 Previously, we removed non-mux rangefeed code in #125610. However, that patch forgot to remove non-mux rangefeed metrics. This patch removes these metrics as they are no longer needed. Epic: none Release note: none 126498: status: fix TestTenantStatusAPI test r=xinhaoz a=dhartunian Previously, this test would use a single connection, cancel it, and then use the connection to verify the cancellation. The test is adjusted here to use two separate sessions, one to cancel for testing, and another to observe the cancellation. Resolves: #125404 Epic: None Release note: None 126524: sql: unskip Insights test r=dhartunian a=dhartunian This test has been flaky for a while because of the async tagging of the TransactionID to the insight that somtimes takes too long to complete. This change removes that check and unskips the test so that we can catch regressions for this feature. In the future we may want to write a separate test to verify the async transactionID tagging separately. Resolves: #125771 Resolves: #121986 Epic: None Release note: None 126533: kv: hook Raft StoreLiveness into storeliveness package r=nvanbenschoten a=nvanbenschoten Fixes #125242. This commit adds a `replicaRLockedStoreLiveness` adapter type to hook the raft store liveness into the storeliveness package. This is currently unused. Release note: None 126536: roachpb: add Leader lease type definition r=nvanbenschoten a=nvanbenschoten Fixes #125225. This commit adds a new `Term` field to the Lease struct. This field defines the term of the raft leader that a leader lease is associated with. The lease is valid for as long as the raft leader has a guarantee from store liveness that it remains the leader under this term. The lease is invalid if the raft leader loses leadership (i.e. changes its term). The field is not yet used. Release note: None Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Wenyi Hu <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

Previously, when SQL queries for dumping tables to debug zip would fail, we would have no follow-up. Engineers can now define "fallback" queries for tables in debug zip in order to make a second attempt with a simpler query. Often we want to run a more complex query to gather more debug data but these queries can fail when the cluster is experiencing problems. This change gives us a chance to define a simpler approach that can be attempted when necessary. In order to define a fallback, there are two new optional fields in the `TableRegistryConfig` struct for redacted and unredacted queries respectively. Debug zip output will still include the failed attempts at the original query along with the error message file as before. If a fallback query is defined, that query will produce its own output (and error) file with an additional `.fallback` suffix added to the base table name to identify it. Resolves: cockroachdb#123964 Epic: CRDB-35278 Release note: None

xinhaoz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 10, 2024

xinhaoz self-assigned this May 10, 2024

xinhaoz added the T-observability label May 10, 2024

xinhaoz changed the title ~~cli: unredacted debug zip for transaction_contention_events table too slow~~ cli: unredacted debug zip query for transaction_contention_events table too slow May 10, 2024

exalate-issue-sync bot changed the title ~~cli: unredacted debug zip query for transaction_contention_events table too slow~~ cli: unredacted debug zip for transaction_contention_events table too slow May 10, 2024

exalate-issue-sync bot changed the title ~~cli: unredacted debug zip for transaction_contention_events table too slow~~ cli: unredacted debug zip query for transaction_contention_events table too slow May 10, 2024

exalate-issue-sync bot changed the title ~~cli: unredacted debug zip query for transaction_contention_events table too slow~~ cli: implement fallback query for transaction_contention_events when statement times out May 28, 2024

exalate-issue-sync bot assigned dhartunian and unassigned xinhaoz May 28, 2024

exalate-issue-sync bot unassigned dhartunian Jun 10, 2024

exalate-issue-sync bot assigned dhartunian Jun 25, 2024

dhartunian mentioned this issue Jun 27, 2024

cli: add fallback query support for debug zip #126352

Merged

craig bot closed this as completed in 535bef4 Jul 2, 2024

dhartunian mentioned this issue Jul 9, 2024

release-24.1: cli: add fallback query support for debug zip #126893

Merged

dhartunian mentioned this issue Jul 12, 2024

release-23.2: cli: add fallback query support for debug zip #127068

Merged

dhartunian mentioned this issue Jul 12, 2024

release-23.1: cli: add fallback query support for debug zip #127069

Merged

lunevalex added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Aug 1, 2024

dhartunian mentioned this issue Aug 2, 2024

release-23.1: cli: add fallback query support for debug zip #128235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: implement fallback query for transaction_contention_events when statement times out #123964

cli: implement fallback query for transaction_contention_events when statement times out #123964

xinhaoz commented May 10, 2024 •

edited by exalate-issue-sync bot

Loading

cli: implement fallback query for transaction_contention_events when statement times out #123964

cli: implement fallback query for transaction_contention_events when statement times out #123964

Comments

xinhaoz commented May 10, 2024 • edited by exalate-issue-sync bot Loading

xinhaoz commented May 10, 2024 •

edited by exalate-issue-sync bot

Loading