-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlstats: improve observability into statement activity update job #119779
Labels
A-observability-inf
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Comments
abarganier
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-observability-inf
labels
Feb 29, 2024
exalate-issue-sync
bot
added
A-observability-inf
and removed
A-observability-inf
T-cluster-observability
labels
Feb 29, 2024
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
Mar 14, 2024
Addresses: cockroachdb#119779 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity_job.runs.successful`: Number of successful runs made by the SQL activity updater job - `sql.stats.activity_job.latency`: The latency of successful runs made by the SQL activity updater job
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
Mar 14, 2024
Addresses: cockroachdb#119779 We've had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Release note (ops change): A new counter metric, `sql.stats.flush.done_signals_ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete.
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
Mar 18, 2024
Addresses: cockroachdb#119779 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity.updates.successful`: Number of successful updates made by the SQL activity updater job. - `sql.stats.activity.update.latency`: The latency of updates made by the SQL activity updater job. Includes failed update attempts.
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
Mar 18, 2024
Addresses: cockroachdb#119779 We've had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Release note (ops change): A new counter metric, `sql.stats.flush.done_signals.ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete.
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
Mar 18, 2024
Addresses: cockroachdb#119779 The count of unique fingeprints flushed to `system.statement_statistics` and `system.transaction_statistics` is the core component that determines data cardinality within the SQL stats subsystem. Today, we don't have good metrics around this source of cardinality. As we aim to reduce cardinality by improving our fingerprinting algorithms, creating a metric to count the number of unique statement and transaction fingerprints included in each flush of the in-memory SQL stats will be a helpful measurement to benchmark cardinality reduction. This patch adds a new metric to track the # of unique fingerprints (stmt and txn) included in each flush. Release note (ops change): A new counter metric, `sql.stats.flush.fingerprint.count`, has been introduced. The metric tracks the number of unique statement and transaction fingerprints included in the SQL Stats flush.
craig bot
pushed a commit
that referenced
this issue
Mar 18, 2024
120522: pkg/sql: expand metrics used by the SQL stats activity update job r=koorosh a=abarganier pkg/sql: expand metrics used by the SQL stats activity update job Addresses: #119779 Epic: CRDB-24527 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. We've also had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Finally, we rename the metric counting failures of the job to `sql.stats.activity_job.runs.failed`, as the old metric name was not descriptive. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity_job.runs.successful`: Number of successful runs made by the SQL activity updater job - `sql.stats.activity_job.latency`: The latency of successful runs made by the SQL activity updater job Release note (ops change): A new counter metric, `sql.stats.flush.done_signals_ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete. Release note (ops change): A new counter metric, `sql.stats.activity_job.runs.failed`, has been introduced to measure the number of runs made by the SQL activity updater job that failed with errors. The SQL activity update job is used to pre-aggregate top K information within the SQL stats subsystem and write the results to `system.statement_activity` and `system.transaction_activity`. Co-authored-by: Alex Barganier <[email protected]>
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
Mar 19, 2024
Informs: cockroachdb#119779 This patch renames a few of the core metrics used to track the SQL stats flush job to be more in-line with how the SQL activity update job's metrics work. Primarily, it renames them for consistency. It also changes the general `count` metric to only track successful executions, so we can more easily discern between failed and successful executions. Release note (ops change): Metrics used to track the SQL stats subsystem's task that flushes in-memory stats to persisted storage have been reworked slightly to be more consistent with other metrics used in the subsystem. The metrics are: - `sql.stats.flushes.successful`: Number of times SQL Stats are flushed successfully to persistent storage - `sql.stats.flushes.failed`: Number of attempted SQL Stats flushes that failed with errors - `sql.stats.flush.latency`: The latency of SQL Stats flushes to persistent storage. Includes failed flush attempts
craig bot
pushed a commit
that referenced
this issue
Mar 20, 2024
120030: kvserver: opt-in "fast" external file snapshotting r=itsbilal a=stevendanna Like SharedSSTs, we may want to send a pebble.ExternalFile's metadata rather than its content during a snapshot. This is opt-in via a cluster setting and is only attempted when the store appears to actually have external files. Epic: none Release note: None 120709: pkg/sql/sqlstats: rework SQL stats flush metrics r=xinhaoz a=abarganier Informs: #119779 Epic: CRDB-24527 This patch renames a few of the core metrics used to track the SQL stats flush job to be more in-line with how the SQL activity update job's metrics work. Primarily, it renames them for consistency. It also changes the general `count` metric to only track successful executions, so we can more easily discern between failed and successful executions. Release note (ops change): Metrics used to track the SQL stats subsystem's task that flushes in-memory stats to persisted storage have been reworked slightly to be more consistent with other metrics used in the subsystem. The metrics are: - `sql.stats.flushes.successful`: Number of times SQL Stats are flushed successfully to persistent storage - `sql.stats.flushes.failed`: Number of attempted SQL Stats flushes that failed with errors - `sql.stats.flush.latency`: The latency of SQL Stats flushes to persistent storage. Includes failed flush attempts Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Alex Barganier <[email protected]>
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
May 10, 2024
Addresses: cockroachdb#119779 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity.updates.successful`: Number of successful updates made by the SQL activity updater job. - `sql.stats.activity.update.latency`: The latency of updates made by the SQL activity updater job. Includes failed update attempts.
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
May 10, 2024
Addresses: cockroachdb#119779 We've had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Release note (ops change): A new counter metric, `sql.stats.flush.done_signals.ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete.
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
May 10, 2024
Addresses: cockroachdb#119779 The count of unique fingeprints flushed to `system.statement_statistics` and `system.transaction_statistics` is the core component that determines data cardinality within the SQL stats subsystem. Today, we don't have good metrics around this source of cardinality. As we aim to reduce cardinality by improving our fingerprinting algorithms, creating a metric to count the number of unique statement and transaction fingerprints included in each flush of the in-memory SQL stats will be a helpful measurement to benchmark cardinality reduction. This patch adds a new metric to track the # of unique fingerprints (stmt and txn) included in each flush. Release note (ops change): A new counter metric, `sql.stats.flush.fingerprint.count`, has been introduced. The metric tracks the number of unique statement and transaction fingerprints included in the SQL Stats flush.
abarganier
added a commit
to abarganier/cockroach
that referenced
this issue
May 13, 2024
Addresses: cockroachdb#119779 The count of unique fingeprints flushed to `system.statement_statistics` and `system.transaction_statistics` is the core component that determines data cardinality within the SQL stats subsystem. Today, we don't have good metrics around this source of cardinality. As we aim to reduce cardinality by improving our fingerprinting algorithms, creating a metric to count the number of unique statement and transaction fingerprints included in each flush of the in-memory SQL stats will be a helpful measurement to benchmark cardinality reduction. This patch adds a new metric to track the # of unique fingerprints (stmt and txn) included in each flush. Release note (ops change): A new counter metric, `sql.stats.flush.fingerprint.count`, has been introduced. The metric tracks the number of unique statement and transaction fingerprints included in the SQL Stats flush.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-observability-inf
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Is your feature request related to a problem? Please describe.
Currently, the job used to populate the
system.{statement_activity|transaction_activity}
table has limited observability into its runs.For example, we have no information about how long the job takes to run, how many times the job has run, etc. This can make debugging the job quite difficult.
Describe the solution you'd like
Let's gain some easy wins for observability into this system by adding the following:
Describe alternatives you've considered
Another idea would be to add some observability around the channel used to trigger this job [1]. Recent work has been done to not have this flush operation block if there's not a listener on the channel. We could add a metric to count the number of times that
default
case is taken in theselect
clause, to better understand how often the SQL activity update job is still "busy" when the flush operation tries signaling it.Jira issue: CRDB-36309
Epic CRDB-24527
The text was updated successfully, but these errors were encountered: