sqlstats: improve observability into statement activity update job #119779

abarganier · 2024-02-29T19:29:40Z

Is your feature request related to a problem? Please describe.
Currently, the job used to populate the system.{statement_activity|transaction_activity} table has limited observability into its runs.

For example, we have no information about how long the job takes to run, how many times the job has run, etc. This can make debugging the job quite difficult.

Describe the solution you'd like
Let's gain some easy wins for observability into this system by adding the following:

Histogram metric representing job run latencies
Counter metric representing the number of successful job runs.

Describe alternatives you've considered
Another idea would be to add some observability around the channel used to trigger this job [1]. Recent work has been done to not have this flush operation block if there's not a listener on the channel. We could add a metric to count the number of times that default case is taken in the select clause, to better understand how often the SQL activity update job is still "busy" when the flush operation tries signaling it.

Jira issue: CRDB-36309

Epic CRDB-24527

The text was updated successfully, but these errors were encountered:

Addresses: cockroachdb#119779 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity_job.runs.successful`: Number of successful runs made by the SQL activity updater job - `sql.stats.activity_job.latency`: The latency of successful runs made by the SQL activity updater job

Addresses: cockroachdb#119779 We've had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Release note (ops change): A new counter metric, `sql.stats.flush.done_signals_ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete.

Addresses: cockroachdb#119779 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity.updates.successful`: Number of successful updates made by the SQL activity updater job. - `sql.stats.activity.update.latency`: The latency of updates made by the SQL activity updater job. Includes failed update attempts.

Addresses: cockroachdb#119779 We've had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Release note (ops change): A new counter metric, `sql.stats.flush.done_signals.ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete.

Addresses: cockroachdb#119779 The count of unique fingeprints flushed to `system.statement_statistics` and `system.transaction_statistics` is the core component that determines data cardinality within the SQL stats subsystem. Today, we don't have good metrics around this source of cardinality. As we aim to reduce cardinality by improving our fingerprinting algorithms, creating a metric to count the number of unique statement and transaction fingerprints included in each flush of the in-memory SQL stats will be a helpful measurement to benchmark cardinality reduction. This patch adds a new metric to track the # of unique fingerprints (stmt and txn) included in each flush. Release note (ops change): A new counter metric, `sql.stats.flush.fingerprint.count`, has been introduced. The metric tracks the number of unique statement and transaction fingerprints included in the SQL Stats flush.

120522: pkg/sql: expand metrics used by the SQL stats activity update job r=koorosh a=abarganier pkg/sql: expand metrics used by the SQL stats activity update job Addresses: #119779 Epic: CRDB-24527 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. We've also had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Finally, we rename the metric counting failures of the job to `sql.stats.activity_job.runs.failed`, as the old metric name was not descriptive. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity_job.runs.successful`: Number of successful runs made by the SQL activity updater job - `sql.stats.activity_job.latency`: The latency of successful runs made by the SQL activity updater job Release note (ops change): A new counter metric, `sql.stats.flush.done_signals_ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete. Release note (ops change): A new counter metric, `sql.stats.activity_job.runs.failed`, has been introduced to measure the number of runs made by the SQL activity updater job that failed with errors. The SQL activity update job is used to pre-aggregate top K information within the SQL stats subsystem and write the results to `system.statement_activity` and `system.transaction_activity`. Co-authored-by: Alex Barganier <[email protected]>

Informs: cockroachdb#119779 This patch renames a few of the core metrics used to track the SQL stats flush job to be more in-line with how the SQL activity update job's metrics work. Primarily, it renames them for consistency. It also changes the general `count` metric to only track successful executions, so we can more easily discern between failed and successful executions. Release note (ops change): Metrics used to track the SQL stats subsystem's task that flushes in-memory stats to persisted storage have been reworked slightly to be more consistent with other metrics used in the subsystem. The metrics are: - `sql.stats.flushes.successful`: Number of times SQL Stats are flushed successfully to persistent storage - `sql.stats.flushes.failed`: Number of attempted SQL Stats flushes that failed with errors - `sql.stats.flush.latency`: The latency of SQL Stats flushes to persistent storage. Includes failed flush attempts

120030: kvserver: opt-in "fast" external file snapshotting r=itsbilal a=stevendanna Like SharedSSTs, we may want to send a pebble.ExternalFile's metadata rather than its content during a snapshot. This is opt-in via a cluster setting and is only attempted when the store appears to actually have external files. Epic: none Release note: None 120709: pkg/sql/sqlstats: rework SQL stats flush metrics r=xinhaoz a=abarganier Informs: #119779 Epic: CRDB-24527 This patch renames a few of the core metrics used to track the SQL stats flush job to be more in-line with how the SQL activity update job's metrics work. Primarily, it renames them for consistency. It also changes the general `count` metric to only track successful executions, so we can more easily discern between failed and successful executions. Release note (ops change): Metrics used to track the SQL stats subsystem's task that flushes in-memory stats to persisted storage have been reworked slightly to be more consistent with other metrics used in the subsystem. The metrics are: - `sql.stats.flushes.successful`: Number of times SQL Stats are flushed successfully to persistent storage - `sql.stats.flushes.failed`: Number of attempted SQL Stats flushes that failed with errors - `sql.stats.flush.latency`: The latency of SQL Stats flushes to persistent storage. Includes failed flush attempts Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Alex Barganier <[email protected]>

Addresses: cockroachdb#119779 Currently, the SQL activity update job is lacking observability. While we have a metric for job failures, we've seen instances whe the query run by the job gets caught in a retry loop, meaning the metric is rarely incremented. Therefore, additional metrics, such as counts of successful runs, and the latency of successful runs, will be helpful to further inspect the state of the job. This patch adds metrics for both. Release note (ops change): Two new metrics have been added to track the status of the SQL activity update job, which is used to pre-aggregate top K information within the SQL stats subsytem and write the results to `system.statement_activity` and `system.transaction_activity`. The new metrics are: - `sql.stats.activity.updates.successful`: Number of successful updates made by the SQL activity updater job. - `sql.stats.activity.update.latency`: The latency of updates made by the SQL activity updater job. Includes failed update attempts.

Addresses: cockroachdb#119779 We've had escalations recently involving the SQL activity update job running for extended periods of time, such that the signal made to the job indicating a flush has completed was not received because there was no listener. While we've added a default case to prevent this from hanging the flush job, and some logging to go with it, a counter metric indicating when this occurs would also be useful to have when debugging. This patch adds such a counter. Release note (ops change): A new counter metric, `sql.stats.flush.done_signals.ignored`, has been introduced. The metric tracks the number of times the SQL Stats activity update job ignored the signal sent to it indicating a flush has completed. This may indicate that the SQL Activity update job is taking longer than expected to complete.

Addresses: cockroachdb#119779 The count of unique fingeprints flushed to `system.statement_statistics` and `system.transaction_statistics` is the core component that determines data cardinality within the SQL stats subsystem. Today, we don't have good metrics around this source of cardinality. As we aim to reduce cardinality by improving our fingerprinting algorithms, creating a metric to count the number of unique statement and transaction fingerprints included in each flush of the in-memory SQL stats will be a helpful measurement to benchmark cardinality reduction. This patch adds a new metric to track the # of unique fingerprints (stmt and txn) included in each flush. Release note (ops change): A new counter metric, `sql.stats.flush.fingerprint.count`, has been introduced. The metric tracks the number of unique statement and transaction fingerprints included in the SQL Stats flush.

abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Feb 29, 2024

abarganier self-assigned this Feb 29, 2024

blathers-crl bot added the A-observability-inf label Feb 29, 2024

exalate-issue-sync bot added A-observability-inf and removed A-observability-inf T-cluster-observability labels Feb 29, 2024

abarganier mentioned this issue Mar 14, 2024

pkg/sql: expand metrics used by the SQL stats activity update job #120522

Merged

exalate-issue-sync bot closed this as completed Mar 18, 2024

abarganier mentioned this issue Mar 19, 2024

pkg/sql/sqlstats: rework SQL stats flush metrics #120709

Merged

abarganier mentioned this issue May 10, 2024

release-23.2: pkg/sql: expand metrics used by the SQL stats activity update job #123960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlstats: improve observability into statement activity update job #119779

sqlstats: improve observability into statement activity update job #119779

abarganier commented Feb 29, 2024 •

edited by exalate-issue-sync bot

Loading

sqlstats: improve observability into statement activity update job #119779

sqlstats: improve observability into statement activity update job #119779

Comments

abarganier commented Feb 29, 2024 • edited by exalate-issue-sync bot Loading

abarganier commented Feb 29, 2024 •

edited by exalate-issue-sync bot

Loading