Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats: improve observability into statement activity update job #119779

Closed
abarganier opened this issue Feb 29, 2024 · 0 comments
Closed

sqlstats: improve observability into statement activity update job #119779

abarganier opened this issue Feb 29, 2024 · 0 comments
Assignees
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@abarganier
Copy link
Contributor

abarganier commented Feb 29, 2024

Is your feature request related to a problem? Please describe.
Currently, the job used to populate the system.{statement_activity|transaction_activity} table has limited observability into its runs.

For example, we have no information about how long the job takes to run, how many times the job has run, etc. This can make debugging the job quite difficult.

Describe the solution you'd like
Let's gain some easy wins for observability into this system by adding the following:

  • Histogram metric representing job run latencies
  • Counter metric representing the number of successful job runs.

Describe alternatives you've considered
Another idea would be to add some observability around the channel used to trigger this job [1]. Recent work has been done to not have this flush operation block if there's not a listener on the channel. We could add a metric to count the number of times that default case is taken in the select clause, to better understand how often the SQL activity update job is still "busy" when the flush operation tries signaling it.

Jira issue: CRDB-36309

Epic CRDB-24527

@abarganier abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Feb 29, 2024
@abarganier abarganier self-assigned this Feb 29, 2024
abarganier added a commit to abarganier/cockroach that referenced this issue Mar 14, 2024
Addresses: cockroachdb#119779

Currently, the SQL activity update job is lacking observability. While
we have a metric for job failures, we've seen instances whe the query
run by the job gets caught in a retry loop, meaning the metric is rarely
incremented.

Therefore, additional metrics, such as counts of successful runs, and
the latency of successful runs, will be helpful to further inspect the
state of the job.

This patch adds metrics for both.

Release note (ops change): Two new metrics have been added to track the
status of the SQL activity update job, which is used to pre-aggregate
top K information within the SQL stats subsytem and write the results to
`system.statement_activity` and `system.transaction_activity`.

The new metrics are:
- `sql.stats.activity_job.runs.successful`: Number of successful runs made
   by the SQL activity updater job
- `sql.stats.activity_job.latency`: The latency of successful runs made by
  the SQL activity updater job
abarganier added a commit to abarganier/cockroach that referenced this issue Mar 14, 2024
Addresses: cockroachdb#119779

We've had escalations recently involving the SQL activity update job
running for extended periods of time, such that the signal made to the
job indicating a flush has completed was not received because there was
no listener.

While we've added a default case to prevent this from hanging the flush
job, and some logging to go with it, a counter metric indicating when
this occurs would also be useful to have when debugging.

This patch adds such a counter.

Release note (ops change): A new counter metric,
`sql.stats.flush.done_signals_ignored`, has been introduced. The metric
tracks the number of times the SQL Stats activity update job ignored
the signal sent to it indicating a flush has completed. This may
indicate that the SQL Activity update job is taking longer than expected
to complete.
abarganier added a commit to abarganier/cockroach that referenced this issue Mar 18, 2024
Addresses: cockroachdb#119779

Currently, the SQL activity update job is lacking observability. While
we have a metric for job failures, we've seen instances whe the query
run by the job gets caught in a retry loop, meaning the metric is rarely
incremented.

Therefore, additional metrics, such as counts of successful runs, and
the latency of successful runs, will be helpful to further inspect the
state of the job.

This patch adds metrics for both.

Release note (ops change): Two new metrics have been added to track the
status of the SQL activity update job, which is used to pre-aggregate
top K information within the SQL stats subsytem and write the results to
`system.statement_activity` and `system.transaction_activity`.

The new metrics are:
- `sql.stats.activity.updates.successful`: Number of successful updates made
   by the SQL activity updater job.
- `sql.stats.activity.update.latency`: The latency of updates  made by
  the SQL activity updater job. Includes failed update attempts.
abarganier added a commit to abarganier/cockroach that referenced this issue Mar 18, 2024
Addresses: cockroachdb#119779

We've had escalations recently involving the SQL activity update job
running for extended periods of time, such that the signal made to the
job indicating a flush has completed was not received because there was
no listener.

While we've added a default case to prevent this from hanging the flush
job, and some logging to go with it, a counter metric indicating when
this occurs would also be useful to have when debugging.

This patch adds such a counter.

Release note (ops change): A new counter metric,
`sql.stats.flush.done_signals.ignored`, has been introduced. The metric
tracks the number of times the SQL Stats activity update job ignored
the signal sent to it indicating a flush has completed. This may
indicate that the SQL Activity update job is taking longer than expected
to complete.
abarganier added a commit to abarganier/cockroach that referenced this issue Mar 18, 2024
Addresses: cockroachdb#119779

The count of unique fingeprints flushed to `system.statement_statistics`
and `system.transaction_statistics` is the core component that
determines data cardinality within the SQL stats subsystem. Today, we
don't have good metrics around this source of cardinality. As we aim to
reduce cardinality by improving our fingerprinting algorithms, creating
a metric to count the number of unique statement and transaction
fingerprints included in each flush of the in-memory SQL stats will be a
helpful measurement to benchmark cardinality reduction.

This patch adds a new metric to track the # of unique fingerprints (stmt
and txn) included in each flush.

Release note (ops change): A new counter metric,
`sql.stats.flush.fingerprint.count`, has been introduced. The metric
tracks the number of unique statement and transaction fingerprints
included in the SQL Stats flush.
craig bot pushed a commit that referenced this issue Mar 18, 2024
120522: pkg/sql: expand metrics used by the SQL stats activity update job r=koorosh a=abarganier

pkg/sql: expand metrics used by the SQL stats activity update job

Addresses: #119779

Epic: CRDB-24527

Currently, the SQL activity update job is lacking observability. While
we have a metric for job failures, we've seen instances whe the query
run by the job gets caught in a retry loop, meaning the metric is rarely
incremented.

Therefore, additional metrics, such as counts of successful runs, and
the latency of successful runs, will be helpful to further inspect the
state of the job.

This patch adds metrics for both.

We've also had escalations recently involving the SQL activity update job
running for extended periods of time, such that the signal made to the
job indicating a flush has completed was not received because there was
no listener.

While we've added a default case to prevent this from hanging the flush
job, and some logging to go with it, a counter metric indicating when
this occurs would also be useful to have when debugging.

This patch adds such a counter.

Finally, we rename the metric counting failures of the job to `sql.stats.activity_job.runs.failed`,
as the old metric name was not descriptive. 

Release note (ops change): Two new metrics have been added to track the
status of the SQL activity update job, which is used to pre-aggregate
top K information within the SQL stats subsytem and write the results to
`system.statement_activity` and `system.transaction_activity`.

The new metrics are:
- `sql.stats.activity_job.runs.successful`: Number of successful runs made
   by the SQL activity updater job
- `sql.stats.activity_job.latency`: The latency of successful runs made by
  the SQL activity updater job

Release note (ops change): A new counter metric,
`sql.stats.flush.done_signals_ignored`, has been introduced. The metric
tracks the number of times the SQL Stats activity update job ignored
the signal sent to it indicating a flush has completed. This may
indicate that the SQL Activity update job is taking longer than expected
to complete.

Release note (ops change): A new counter metric,
`sql.stats.activity_job.runs.failed`, has been introduced to measure the
number of runs made by the SQL activity updater job that failed with
errors. The SQL activity update job is used to pre-aggregate top K
information within the SQL stats subsystem and write the results to
`system.statement_activity` and `system.transaction_activity`.

Co-authored-by: Alex Barganier <[email protected]>
abarganier added a commit to abarganier/cockroach that referenced this issue Mar 19, 2024
Informs: cockroachdb#119779

This patch renames a few of the core metrics used to track the SQL stats
flush job to be more in-line with how the SQL activity update job's
metrics work.

Primarily, it renames them for consistency. It also changes the general
`count` metric to only track successful executions, so we can more
easily discern between failed and successful executions.

Release note (ops change): Metrics used to track the SQL stats
subsystem's task that flushes in-memory stats to persisted storage have
been reworked slightly to be more consistent with other metrics used in
the subsystem. The metrics are:

- `sql.stats.flushes.successful`: Number of times SQL Stats are flushed
   successfully to persistent storage
- `sql.stats.flushes.failed`: Number of attempted SQL Stats flushes that
   failed with errors
- `sql.stats.flush.latency`: The latency of SQL Stats flushes to
   persistent storage. Includes failed flush attempts
craig bot pushed a commit that referenced this issue Mar 20, 2024
120030: kvserver: opt-in "fast" external file snapshotting  r=itsbilal a=stevendanna

Like SharedSSTs, we may want to send a pebble.ExternalFile's metadata
rather than its content during a snapshot.

This is opt-in via a cluster setting and is only attempted when the
store appears to actually have external files.

Epic: none
Release note: None

120709: pkg/sql/sqlstats: rework SQL stats flush metrics r=xinhaoz a=abarganier

Informs: #119779

Epic: CRDB-24527

This patch renames a few of the core metrics used to track the SQL stats
flush job to be more in-line with how the SQL activity update job's
metrics work.

Primarily, it renames them for consistency. It also changes the general
`count` metric to only track successful executions, so we can more
easily discern between failed and successful executions.

Release note (ops change): Metrics used to track the SQL stats
subsystem's task that flushes in-memory stats to persisted storage have
been reworked slightly to be more consistent with other metrics used in
the subsystem. The metrics are:

- `sql.stats.flushes.successful`: Number of times SQL Stats are flushed
 successfully to persistent storage
- `sql.stats.flushes.failed`: Number of attempted SQL Stats flushes that
 failed with errors
- `sql.stats.flush.latency`: The latency of SQL Stats flushes to
 persistent storage. Includes failed flush attempts

Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
abarganier added a commit to abarganier/cockroach that referenced this issue May 10, 2024
Addresses: cockroachdb#119779

Currently, the SQL activity update job is lacking observability. While
we have a metric for job failures, we've seen instances whe the query
run by the job gets caught in a retry loop, meaning the metric is rarely
incremented.

Therefore, additional metrics, such as counts of successful runs, and
the latency of successful runs, will be helpful to further inspect the
state of the job.

This patch adds metrics for both.

Release note (ops change): Two new metrics have been added to track the
status of the SQL activity update job, which is used to pre-aggregate
top K information within the SQL stats subsytem and write the results to
`system.statement_activity` and `system.transaction_activity`.

The new metrics are:
- `sql.stats.activity.updates.successful`: Number of successful updates made
   by the SQL activity updater job.
- `sql.stats.activity.update.latency`: The latency of updates  made by
  the SQL activity updater job. Includes failed update attempts.
abarganier added a commit to abarganier/cockroach that referenced this issue May 10, 2024
Addresses: cockroachdb#119779

We've had escalations recently involving the SQL activity update job
running for extended periods of time, such that the signal made to the
job indicating a flush has completed was not received because there was
no listener.

While we've added a default case to prevent this from hanging the flush
job, and some logging to go with it, a counter metric indicating when
this occurs would also be useful to have when debugging.

This patch adds such a counter.

Release note (ops change): A new counter metric,
`sql.stats.flush.done_signals.ignored`, has been introduced. The metric
tracks the number of times the SQL Stats activity update job ignored
the signal sent to it indicating a flush has completed. This may
indicate that the SQL Activity update job is taking longer than expected
to complete.
abarganier added a commit to abarganier/cockroach that referenced this issue May 10, 2024
Addresses: cockroachdb#119779

The count of unique fingeprints flushed to `system.statement_statistics`
and `system.transaction_statistics` is the core component that
determines data cardinality within the SQL stats subsystem. Today, we
don't have good metrics around this source of cardinality. As we aim to
reduce cardinality by improving our fingerprinting algorithms, creating
a metric to count the number of unique statement and transaction
fingerprints included in each flush of the in-memory SQL stats will be a
helpful measurement to benchmark cardinality reduction.

This patch adds a new metric to track the # of unique fingerprints (stmt
and txn) included in each flush.

Release note (ops change): A new counter metric,
`sql.stats.flush.fingerprint.count`, has been introduced. The metric
tracks the number of unique statement and transaction fingerprints
included in the SQL Stats flush.
abarganier added a commit to abarganier/cockroach that referenced this issue May 13, 2024
Addresses: cockroachdb#119779

The count of unique fingeprints flushed to `system.statement_statistics`
and `system.transaction_statistics` is the core component that
determines data cardinality within the SQL stats subsystem. Today, we
don't have good metrics around this source of cardinality. As we aim to
reduce cardinality by improving our fingerprinting algorithms, creating
a metric to count the number of unique statement and transaction
fingerprints included in each flush of the in-memory SQL stats will be a
helpful measurement to benchmark cardinality reduction.

This patch adds a new metric to track the # of unique fingerprints (stmt
and txn) included in each flush.

Release note (ops change): A new counter metric,
`sql.stats.flush.fingerprint.count`, has been introduced. The metric
tracks the number of unique statement and transaction fingerprints
included in the SQL Stats flush.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

1 participant