Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: implement fallback query for transaction_contention_events when statement times out #123964

Closed
xinhaoz opened this issue May 10, 2024 · 0 comments · Fixed by #126352
Closed
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability

Comments

@xinhaoz
Copy link
Member

xinhaoz commented May 10, 2024

#118478 aimed to add the query texts for transactions in crdb_internal.transaction_contention_events to the debug zip. We've seen this query timeout for clusters experiencing a lot of contention. The join collecting the blocking transaction's query strings results in a full scan of the system.statement_statistics table and should be removed. We can investigate other options or queries that may preserve some of this information.

Update May 16:
We'll do 3 things to improve upon what was added:

  1. Increase default timeout - a couple of of other queries (e.g. cluster locks table) are also timing out, so we should just increaes the default itmeout. Note that there's also a CLI flag for users to set this.
  2. Include a fallback query in addition to the one that was added above that includes the non-expensive information
  3. Improve runtime of the added query by using the index on statement fingerprint id (we'll have to join on the txn stats table to accomplish this)

Jira issue: CRDB-38628

Epic CRDB-35278

@xinhaoz xinhaoz added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 10, 2024
@xinhaoz xinhaoz self-assigned this May 10, 2024
@xinhaoz xinhaoz changed the title cli: unredacted debug zip for transaction_contention_events table too slow cli: unredacted debug zip query for transaction_contention_events table too slow May 10, 2024
@exalate-issue-sync exalate-issue-sync bot changed the title cli: unredacted debug zip query for transaction_contention_events table too slow cli: unredacted debug zip for transaction_contention_events table too slow May 10, 2024
@exalate-issue-sync exalate-issue-sync bot changed the title cli: unredacted debug zip for transaction_contention_events table too slow cli: unredacted debug zip query for transaction_contention_events table too slow May 10, 2024
@exalate-issue-sync exalate-issue-sync bot changed the title cli: unredacted debug zip query for transaction_contention_events table too slow cli: implement fallback query for transaction_contention_events when statement times out May 28, 2024
@exalate-issue-sync exalate-issue-sync bot assigned dhartunian and unassigned xinhaoz May 28, 2024
craig bot pushed a commit that referenced this issue Jul 2, 2024
126352: cli: add fallback query support for debug zip r=xinhaoz a=dhartunian

Previously, when SQL queries for dumping tables to debug zip would fail, we would have no follow-up. Engineers can now define "fallback" queries for tables in debug zip in order to make a second attempt with a simpler query. Often we want to run a more complex query to gather more debug data but these queries can fail when the cluster is experiencing problems. This change gives us a chance to define a simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in the `TableRegistryConfig` struct for redacted and unredacted queries respectively.

Debug zip output will still include the failed attempts at the original query along with the error message file as before. If a fallback query is defined, that query will produce its own output (and error) file with an additional `.fallback` suffix added to the base table name to identify it.

Resolves: #123964
Epic: CRDB-35278

Release note: None

126354: ui: alter role events render correctly r=xinhaoz a=dhartunian

Previously, ALTER ROLE events without role options would render with an "undefined" option in the event log on the DB Console. This change amends the rendering logic to correctly render events without any options.

Resolves #124871
Epic: None

Release note (bug fix,ui change): ALTER ROLE events in the DB Console event log now render correctly when the event does not contain any role options.

126486: kvserver/rangefeed: remove lockedRangefeedStream r=nvanbenschoten a=wenyihu6

**kvserver: wrap kvpb.RangeFeedEventSink in Stream**

Previously, we declared the same interface signature twice: once in
kvpb.RangeFeedEventSink and again in rangefeed.Stream. This patch embeds
kvpb.RangeFeedEventSink inside rangefeed.Stream, making rangefeed.Stream a
superset of kvpb.RangeFeedEventSink. This approach makes sense, as each
rangefeed server stream should be a rangefeed event sink, capable of making
thread-safe rangefeed event sends.

Epic: none
Release note: none

---

**kvserver/rangefeed: remove lockedRangefeedStream**

Previously, we created separate locked rangefeed streams for each individual
rangefeed stream to ensure Send can be called concurrently as the underlying
grpc stream is not thread safe. However, since the introduction of the mux
rangefeed support, we already have a dedicated lock for the underlying mux
stream, making the Send method on each rangefeed stream thread safe already.
This patch removes the redundant locks from each individual rangefeed stream.

Epic: none
Release note: none

126487: kvserver/rangefeed: remove non-mux rangefeed metrics r=nvanbenschoten a=wenyihu6

Previously, we removed non-mux rangefeed code in
#125610. However, that patch forgot
to remove non-mux rangefeed metrics. This patch removes these metrics as they
are no longer needed.

Epic: none
Release note: none

126498: status: fix TestTenantStatusAPI test r=xinhaoz a=dhartunian

Previously, this test would use a single connection, cancel it, and then use the connection to verify the cancellation.

The test is adjusted here to use two separate sessions, one to cancel for testing, and another to observe the cancellation.

Resolves: #125404
Epic: None

Release note: None

126524: sql: unskip Insights test r=dhartunian a=dhartunian

This test has been flaky for a while because of the async tagging of the TransactionID to the insight that somtimes takes too long to complete. This change removes that check and unskips the test so that we can catch regressions for this feature. In the future we may want to write a separate test to verify the async transactionID tagging separately.

Resolves: #125771
Resolves: #121986

Epic: None
Release note: None

126533: kv: hook Raft StoreLiveness into storeliveness package r=nvanbenschoten a=nvanbenschoten

Fixes #125242.

This commit adds a `replicaRLockedStoreLiveness` adapter type to hook the raft store liveness into the storeliveness package.

This is currently unused.

Release note: None

126536: roachpb: add Leader lease type definition r=nvanbenschoten a=nvanbenschoten

Fixes #125225.

This commit adds a new `Term` field to the Lease struct. This field defines the term of the raft leader that a leader lease is associated with. The lease is valid for as long as the raft leader has a guarantee from store liveness that it remains the leader under this term. The lease is invalid if the raft leader loses leadership (i.e. changes its term).

The field is not yet used.

Release note: None

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 535bef4 Jul 2, 2024
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jul 9, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jul 12, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jul 18, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jul 18, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jul 18, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
@lunevalex lunevalex added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Aug 1, 2024
dhartunian added a commit to dhartunian/cockroach that referenced this issue Aug 5, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Aug 5, 2024
Previously, when SQL queries for dumping tables to debug zip would
fail, we would have no follow-up. Engineers can now define "fallback"
queries for tables in debug zip in order to make a second attempt
with a simpler query. Often we want to run a more complex query to
gather more debug data but these queries can fail when the cluster
is experiencing problems. This change gives us a chance to define a
simpler approach that can be attempted when necessary.

In order to define a fallback, there are two new optional fields in
the `TableRegistryConfig` struct for redacted and unredacted queries
respectively.

Debug zip output will still include the failed attempts at the
original query along with the error message file as before. If a
fallback query is defined, that query will produce its own output (and
error) file with an additional `.fallback` suffix added to the base
table name to identify it.

Resolves: cockroachdb#123964
Epic: CRDB-35278

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants