Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFCS: sql stats persistence #63752

Merged
merged 1 commit into from
May 22, 2021
Merged

Conversation

Azhng
Copy link
Contributor

@Azhng Azhng commented Apr 15, 2021

@Azhng Azhng requested a review from a team April 15, 2021 19:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

* if such entry exists, the flush operation will aggregate the existing entry.
* if such extry does not exist, the flush operation will insert a new entry.

When DB Console issues fetch requests to CRDB node through HTTP endpoint, the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the two tables (sql_stmt_stats and sql_txn_stats) provide access only to the persisted statistics? Have we considered extending this view to represent in-memory statistics as well where when the DB Console (or developer) queries the tables(s), we transparently do a union and/or aggregate across the current in-memory statistics and persisted statistics?

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far!

nit: you should replace CRDB with CockroachDB

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @kevin-v-ngo)


docs/RFCS/20210330_sql_stats_persistence.md, line 17 at r2 (raw file):

the result, CRDB will gain ability that helps users to easily identify
historical transactions and statements that consumes a disproportionate amount
of cluster resource, even after node crashes and restarts.

grammar nits:

overtime -> over time
As the result -> As a result
gain ability that helps users to easily -> gain the ability to help users easily
that consumes -> that consume
cluster resource -> cluster resources


docs/RFCS/20210330_sql_stats_persistence.md, line 22 at r2 (raw file):

Currently, CRDB stores the statement and transaction metrics in memory. The
retention policy for the in-memory storage is one hour be default. During this

be -> by


docs/RFCS/20210330_sql_stats_persistence.md, line 23 at r2 (raw file):

Currently, CRDB stores the statement and transaction metrics in memory. The
retention policy for the in-memory storage is one hour be default. During this
one-hour period, users can query the this in-memory data structure through DB

the this -> this


docs/RFCS/20210330_sql_stats_persistence.md, line 79 at r2 (raw file):

## Design Overview

Two new system tables `system.experimetnal_sql_stmt_stats` and

Let's remove the experimental prefix from the table name, there is no need for it (and it will cause a migration later!)


docs/RFCS/20210330_sql_stats_persistence.md, line 97 at r2 (raw file):

Previously, kevin-v-ngo wrote…

Do the two tables (sql_stmt_stats and sql_txn_stats) provide access only to the persisted statistics? Have we considered extending this view to represent in-memory statistics as well where when the DB Console (or developer) queries the tables(s), we transparently do a union and/or aggregate across the current in-memory statistics and persisted statistics?

This is a cool idea. I think we would want to implement this via a system view with a different name.


docs/RFCS/20210330_sql_stats_persistence.md, line 100 at r2 (raw file):

persisted statistics data can be fetched using `AS OF SYSTEM TIME -10s`
queries in order to minimize read-write contention. However, for the most
up-to-date statistics, there are two possible optiuons:

options*


docs/RFCS/20210330_sql_stats_persistence.md, line 107 at r2 (raw file):

   this means that this option also inherit the disadvantage of the existing
   designs, such as data-loss on crashes, inaccurate stats if nodes becomes
   available etc.

I think I like this first option better. We can tune it to be less hungry - have aggressive timeouts and don't fail if one node isn't there.


docs/RFCS/20210330_sql_stats_persistence.md, line 108 at r2 (raw file):

   designs, such as data-loss on crashes, inaccurate stats if nodes becomes
   available etc.
1. we can give DB Console ability to adjust flush interval dynamically.

I don't like how this option has the side effect of changing the sampling rate and increases the disk throughput required.


docs/RFCS/20210330_sql_stats_persistence.md, line 125 at r2 (raw file):

   can become troublesome in a cluster with large number of nodes.

   (Random thoughts: is it possible we can use job system for this?)

I think we could use the job system if there was 1 coordinator node that asks everyone to flush. But if you want everyone to flush on their own accord, I don't think a job is the right thing.


docs/RFCS/20210330_sql_stats_persistence.md, line 254 at r2 (raw file):

The table's primary key is composed of (stmtID, time). This is to avoid

In the current schema, the primary key also includes whether the statement failed, and whether it happened in an implicit transaction. We probably want to preserve those semantics.


docs/RFCS/20210330_sql_stats_persistence.md, line 332 at r2 (raw file):

   table.
1. Insert the new stats we created in stage 2 into the system table with the
   current clock timestamp.

If we do it at the current clock timestamp, do we have the risk of accidentally pushing forward the timestamp over and over, never accumulating a new historical entry? Since other nodes will keep finding an entry in the past windows. It feels like maybe we should keep the original timestamp for these entries.


docs/RFCS/20210330_sql_stats_persistence.md, line 407 at r2 (raw file):

* Should we store some sort of query plan hash so we can track how the query
  plan has changed for the same query fingerprint?

Yes, I think we should start thinking about doing something like this. But I think it would be hard to combine all of this work together. Let's make sure this design is extensible to include information like that, and we can start asking the Queries team to see if they can start generating a plan ID for plans.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @kevin-v-ngo)


docs/RFCS/20210330_sql_stats_persistence.md, line 25 at r2 (raw file):

one-hour period, users can query the this in-memory data structure through DB
console. However, after the retention period for the collected statistics
expires, users will no longer to be able to access these statistics. There are

nit: users are no longer able to ...


docs/RFCS/20210330_sql_stats_persistence.md, line 35 at r2 (raw file):

   node in the cluster.
   1. Due to this design, if a node becomes unavailable, CRDB will be no longer
      able to provide accurate accouting for statement/transaction statistics.

nit: accounting


docs/RFCS/20210330_sql_stats_persistence.md, line 69 at r2 (raw file):

* Statistics persistence should be low overhead, but the collected statistics
  should also enough resolution to provide meaningful insight into the

nit: should also have enough resolution to provide meaningful insights


docs/RFCS/20210330_sql_stats_persistence.md, line 73 at r2 (raw file):

* There is a need for mechanism to prune old statistics data to reduce the
  burden on storage space.

Will the prune be configurable by the user? We can have a default time/size, but the user should also be able to define how long/much they want to store.


docs/RFCS/20210330_sql_stats_persistence.md, line 95 at r2 (raw file):

system tables within the latest time bucket.
* if such entry exists, the flush operation will aggregate the existing entry.
* if such extry does not exist, the flush operation will insert a new entry.

Statements are only aggregated if they're on the same flush operation, correct? We don't want to aggregate every single statement with the same fingerprint


docs/RFCS/20210330_sql_stats_persistence.md, line 135 at r2 (raw file):

    -- primary key
    fingerprint INT NOT NULL,
    timestamp   TIMESTAMP NOT NULL,

what does this timestamp means? The time these metrics were inserted?
considering these are aggregated, we should have information about first/last time the statement was run


docs/RFCS/20210330_sql_stats_persistence.md, line 265 at r2 (raw file):

mean of the value of that attribute, and the other one for squared difference.
Similar to statement statistics columns, execution statistics columns are
formatted in similar fasion.

*fashion


docs/RFCS/20210330_sql_stats_persistence.md, line 267 at r2 (raw file):

formatted in similar fasion.

### Example queries that can used to answer query performance related questions:

nit: can be used


docs/RFCS/20210330_sql_stats_persistence.md, line 329 at r2 (raw file):

   to combine two `NumericStats`. (Random thoughts: is it worth it to extend
   the SQL Engine to work with NumericStats?)
1. Delete the stats we fetched from the system table in stage 1 from the system

I think we should delete only after we know for sure the insertion step (the next one on this list) was successful, otherwise you might delete, and if something happens before the next insert you lost the data


docs/RFCS/20210330_sql_stats_persistence.md, line 362 at r2 (raw file):

Before we insert the newly combined statistics back into the system table,
we first need to remove the existing entries to avoid duplication. This can be
done using the following statement:

we can have some sort of flag to identify the rows we are handling at the moment, and only after the updated version is inserted, delete those ones. You don't want to risk losing the data before inserting.
Or instead of deleting you can simply update the existing one, if it fails, you won't lose the existing data

@Azhng Azhng force-pushed the rfc-stats-persistence branch from c690b27 to eac0f7a Compare April 19, 2021 16:33
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @kevin-v-ngo, and @maryliag)


docs/RFCS/20210330_sql_stats_persistence.md, line 17 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

grammar nits:

overtime -> over time
As the result -> As a result
gain ability that helps users to easily -> gain the ability to help users easily
that consumes -> that consume
cluster resource -> cluster resources

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 22 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

be -> by

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 23 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

the this -> this

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 73 at r2 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

Will the prune be configurable by the user? We can have a default time/size, but the user should also be able to define how long/much they want to store.

Yes I agree.


docs/RFCS/20210330_sql_stats_persistence.md, line 79 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Let's remove the experimental prefix from the table name, there is no need for it (and it will cause a migration later!)

Sounds good


docs/RFCS/20210330_sql_stats_persistence.md, line 95 at r2 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

Statements are only aggregated if they're on the same flush operation, correct? We don't want to aggregate every single statement with the same fingerprint

Yes, we don't want to aggregate every single statement with the same fingerprint in the entire table.

But we still want to aggregate every single statement with the same fingerprint that was inserted in the current aggregation window.


docs/RFCS/20210330_sql_stats_persistence.md, line 97 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This is a cool idea. I think we would want to implement this via a system view with a different name.

👍


docs/RFCS/20210330_sql_stats_persistence.md, line 100 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

options*

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 107 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think I like this first option better. We can tune it to be less hungry - have aggressive timeouts and don't fail if one node isn't there.

👍


docs/RFCS/20210330_sql_stats_persistence.md, line 108 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't like how this option has the side effect of changing the sampling rate and increases the disk throughput required.

Hmm, I don't quite understand how adjusting flush interval leads to the change in sampling rate?


docs/RFCS/20210330_sql_stats_persistence.md, line 125 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think we could use the job system if there was 1 coordinator node that asks everyone to flush. But if you want everyone to flush on their own accord, I don't think a job is the right thing.

I was just thinking how we can address transaction conflicts. I suppose we can just restart the transaction after some interval and maybe exponential backoff.


docs/RFCS/20210330_sql_stats_persistence.md, line 135 at r2 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

what does this timestamp means? The time these metrics were inserted?
considering these are aggregated, we should have information about first/last time the statement was run

The timestamp here means the time of insertion. But yes this is a good point, I should make this clearer since we will have other additional timestamp data.


docs/RFCS/20210330_sql_stats_persistence.md, line 254 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

In the current schema, the primary key also includes whether the statement failed, and whether it happened in an implicit transaction. We probably want to preserve those semantics.

We already use statement failure status and implicit transaction to construct stmtID. I'm not sure if we can gain additional performance benefit by adding those two boolean columns into primary key.


docs/RFCS/20210330_sql_stats_persistence.md, line 265 at r2 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

*fashion

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 329 at r2 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

I think we should delete only after we know for sure the insertion step (the next one on this list) was successful, otherwise you might delete, and if something happens before the next insert you lost the data

The entire flush operation is done within a single transaction. This means that all these steps are done in an atomic fashion.


docs/RFCS/20210330_sql_stats_persistence.md, line 332 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

If we do it at the current clock timestamp, do we have the risk of accidentally pushing forward the timestamp over and over, never accumulating a new historical entry? Since other nodes will keep finding an entry in the past windows. It feels like maybe we should keep the original timestamp for these entries.

I think using the current_timestamp() in this case would be ok. This is because we calculate the aggregate window as

interval_begin =  floor(current_timestamp() / <interval>)
interval_end = interval_begin + <interval>

aggregate_window = [interval_begin, interval_end]

This means that: the interval_begin and interval_end only influenced by the value of interval and it is invariant of current_timestamp().

In other words, the aggregate_window is a fixed window instead of a rolling window.

Since current_timestamp() is monotonically increasing, then if current_timestamp() ∈ aggregate_window, then existing entry is updated. Else, if "current_timestamp() ∉ window`, then new entry is created.


I think having an increasing timestamp every time we update the value is not a big issue here. Because if we update an existing value instead of inserting a new value, this means we are still within the existing aggregate_window. Hence, the updated value should still have the same behaviour as the original inserting timestamp.


docs/RFCS/20210330_sql_stats_persistence.md, line 362 at r2 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

we can have some sort of flag to identify the rows we are handling at the moment, and only after the updated version is inserted, delete those ones. You don't want to risk losing the data before inserting.
Or instead of deleting you can simply update the existing one, if it fails, you won't lose the existing data

The atomicity of the entire operation is ensured through having all 4 steps ran in a single transaction.


docs/RFCS/20210330_sql_stats_persistence.md, line 407 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Yes, I think we should start thinking about doing something like this. But I think it would be hard to combine all of this work together. Let's make sure this design is extensible to include information like that, and we can start asking the Queries team to see if they can start generating a plan ID for plans.

sounds good.

@Azhng Azhng force-pushed the rfc-stats-persistence branch from eac0f7a to 6999947 Compare April 19, 2021 18:15
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @kevin-v-ngo, and @maryliag)


docs/RFCS/20210330_sql_stats_persistence.md, line 362 at r2 (raw file):

Previously, Azhng (Archer Zhang) wrote…

The atomicity of the entire operation is ensured through having all 4 steps ran in a single transaction.

My thought on using a single delete instead of individual updates here is that: issuing a single large delete operation creates simpler KV operations in the backend. Issuing many delete operations might have more overhead and ends up issuing more KV requests.

@jordanlewis what's your thoughts on this? I'm not too familiar with how DML gets translated into KV operations in this case.

@Azhng Azhng force-pushed the rfc-stats-persistence branch 4 times, most recently from 367b578 to 57ffbc7 Compare April 21, 2021 17:50
@Azhng Azhng marked this pull request as ready for review April 21, 2021 17:50
@Azhng Azhng requested a review from a team as a code owner April 21, 2021 17:50
@Azhng Azhng requested review from rytaft and sumeerbhola April 21, 2021 17:50
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jordanlewis, @kevin-v-ngo, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 80 at r4 (raw file):

## Design Overview

Two new system tables `system.experimetnal_sql_stmt_stats` and

Renaming this to eventually remove the experimental prefix seems like more trouble than its worth; I'd just name it as if it's not experimental (there's also a typo here, but it's moot if you accept the main suggestion).


docs/RFCS/20210330_sql_stats_persistence.md, line 89 at r4 (raw file):

1. at the end of a fixed interval determined by a cluster setting. E.g. every 5
   mins.
1. when user explicitly requests all in-memory statistics to be persisted. E.g.

How fast would the fixed interval need to be so that we could avoid building and exposing an explicit "flush" action? (for example, we don't have a flush action for our timeseries metrics that flush every 10s). I would prefer to speed up the automatic flush (every 60 seconds?) and say "wait up to 1 minute for fresh data" instead of giving a more frequent manual option.

I'd also reserve the option to do automatic flushes on different triggers, such as on memory pressure, or after an unusually slow query.


docs/RFCS/20210330_sql_stats_persistence.md, line 133 at r4 (raw file):

    last_run_at  TIMESTAMP NOT NULL,

    -- stats

That's a lot of columns, and when you have this many columns you've surely missed some. Migrations on system tables are somewhat painful, so I'd generally prefer to store stats in a more extensible format (i.e. json) as much as possible.


docs/RFCS/20210330_sql_stats_persistence.md, line 138 at r4 (raw file):

    max_retries         INT8 NOT NULL,
    num_rows            FLOAT8 NOT NULL,
    num_rows_sd         FLOAT8 NOT NULL,

Storing the standard deviation here is less flexible (it's difficult to aggregate) than storing the sum of squared differences (as we do in the internal roachpb.NumericStat).

I initially interpreted _sd as "standard deviation" instead of "squared differences"; a more explicit name may be a good idea.


docs/RFCS/20210330_sql_stats_persistence.md, line 242 at r4 (raw file):

The table's primary key is composed of `(app_name, fingerprint, created_at)`.
This is to avoid having all nodes writing stats to the same range at the same

"This" meaning the fact that fingerprint comes before the timestamp in the key?

A fingerprint-first key also makes it easy to explore the history of a given query, but it means you have to do a full table scan to find out all of the most recent fingerprints (i.e. what we currently do with a fan-out query. We should design the schema with the assumption that the fan-out option is unavailable, i.e. in serverless). I think we probably need to support both - maybe a (hash-sharded) index on the timestamp in addition to the PK on the fingerprint? (or should the hash-sharded timestamp index be primary since the statement page load is more important than one query's details?)


docs/RFCS/20210330_sql_stats_persistence.md, line 243 at r4 (raw file):

The table's primary key is composed of `(app_name, fingerprint, created_at)`.
This is to avoid having all nodes writing stats to the same range at the same
time, which would result in write-write contentions and range hotspot.

You'll still have contention in a large cluster - each node will be doing a read-modify-write of all the top statement fingerprints. I think it may be better to include a node ID in the PK and all indexes so that different nodes never write to exactly the same key (although this forces you to use more complex aggregation queries for all accesses to this table).


docs/RFCS/20210330_sql_stats_persistence.md, line 256 at r4 (raw file):

### Example queries that can be used to answer query performance related questions:

#### Querying attributes over a time period for a statement.

None of these queries include an app name, but it's the first column of the PK, so they're all full table scans. I think the app name should probably be the last thing in the PK (if it's there at all)


docs/RFCS/20210330_sql_stats_persistence.md, line 293 at r4 (raw file):

SELECT
  fingerprint,
  avg(service_lat) as avg_service_latency,

service_lat is already an average, so this is giving an average-of-averages and a max-of-averages. Instead of average-of-averages, you should recompute the true average. I'd probably replace the max-of-averages with computing a 2 sigma value but that's getting complicated to compute in SQL.

The problem with average-of-average and max-of-averages is that they're sensitive to differences between different sample periods. If we have a manual flush button, someone jamming on that button will result in a lot of short sample periods, skewing the results. If there's an automatic flush after an exceptionally slow query, that will skew the results higher. Recomputing the summary statistics lets us avoid any problems due to varying sampling periods.


docs/RFCS/20210330_sql_stats_persistence.md, line 319 at r4 (raw file):

1. Insert the new stats we created in stage 2 into the system table with the
   current transaction timestamp.
1. Check if number of rows in the persisted table has exceeded maximum limit.

This should probably be a separate process. Look at how we GC from the system.rangelog table; it's decoupled from how we insert into that table.


docs/RFCS/20210330_sql_stats_persistence.md, line 405 at r4 (raw file):

  that has been pruned.

* Currently, we perform aggregation in-memory because of the custom logic in

I think we have to do some sort of pagination here - instead of reading everything in the current aggregation window, use some limit, merge those with our in-memory data, flush them out, and repeat (or conversely, only query those records for which we have an in-memory entry instead of everything from the current window).

Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 80 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Renaming this to eventually remove the experimental prefix seems like more trouble than its worth; I'd just name it as if it's not experimental (there's also a typo here, but it's moot if you accept the main suggestion).

Ah good catch, I think the spelling error is why it escaped my search and replaced 🤦


docs/RFCS/20210330_sql_stats_persistence.md, line 89 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How fast would the fixed interval need to be so that we could avoid building and exposing an explicit "flush" action? (for example, we don't have a flush action for our timeseries metrics that flush every 10s). I would prefer to speed up the automatic flush (every 60 seconds?) and say "wait up to 1 minute for fresh data" instead of giving a more frequent manual option.

I'd also reserve the option to do automatic flushes on different triggers, such as on memory pressure, or after an unusually slow query.

I agree that we should definitely have the option to do automatic flush on different triggers.

Though I'm not sure if I entirely understood the reasoning behind having a smaller fixed flush interval. We can still serve fresh stats with a long flush interval by combing the stats read from a SQL query and the in-memory stats. But this does require dependency on RPC fanout. So is the reasoning for having a smaller flush interval so that we can directly serve fresh stats using only SQL query without relying on the RPC fanout?


docs/RFCS/20210330_sql_stats_persistence.md, line 133 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That's a lot of columns, and when you have this many columns you've surely missed some. Migrations on system tables are somewhat painful, so I'd generally prefer to store stats in a more extensible format (i.e. json) as much as possible.

Interesting, I'll try build a quick prototype to test out using JSON for all the stats.


docs/RFCS/20210330_sql_stats_persistence.md, line 138 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Storing the standard deviation here is less flexible (it's difficult to aggregate) than storing the sum of squared differences (as we do in the internal roachpb.NumericStat).

I initially interpreted _sd as "standard deviation" instead of "squared differences"; a more explicit name may be a good idea.

Good point


docs/RFCS/20210330_sql_stats_persistence.md, line 242 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"This" meaning the fact that fingerprint comes before the timestamp in the key?

A fingerprint-first key also makes it easy to explore the history of a given query, but it means you have to do a full table scan to find out all of the most recent fingerprints (i.e. what we currently do with a fan-out query. We should design the schema with the assumption that the fan-out option is unavailable, i.e. in serverless). I think we probably need to support both - maybe a (hash-sharded) index on the timestamp in addition to the PK on the fingerprint? (or should the hash-sharded timestamp index be primary since the statement page load is more important than one query's details?)

We want to eventually be able to examine the performance of a query over arbitrary period of time. Having a hash-sharded timestamp would make developing the comparison mechanism more difficult.

For supporting statement page loads, we would need to perform full table scan anyway, regardless of either using hash-sharded index or not, in order to aggregate statement/transaction statistics across all time period for each fingerprint.

For finding out all of the most recent fingerprints, I think I'm missing something here. I'm not sure why do we need to find the most recent fingerprints. DB Console sorts statements/transactions on the statement page based on the selected statistics attribute. I don't think the Statement Page can sort based on the first/last executed time of the statement fingerprint.


docs/RFCS/20210330_sql_stats_persistence.md, line 243 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

You'll still have contention in a large cluster - each node will be doing a read-modify-write of all the top statement fingerprints. I think it may be better to include a node ID in the PK and all indexes so that different nodes never write to exactly the same key (although this forces you to use more complex aggregation queries for all accesses to this table).

Good point. I think I will update the PK to be (fingerprint, timestamp, app_name, node_id)


docs/RFCS/20210330_sql_stats_persistence.md, line 256 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

None of these queries include an app name, but it's the first column of the PK, so they're all full table scans. I think the app name should probably be the last thing in the PK (if it's there at all)

Good point. I will move app_name to the latter part of the PK.


docs/RFCS/20210330_sql_stats_persistence.md, line 293 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

service_lat is already an average, so this is giving an average-of-averages and a max-of-averages. Instead of average-of-averages, you should recompute the true average. I'd probably replace the max-of-averages with computing a 2 sigma value but that's getting complicated to compute in SQL.

The problem with average-of-average and max-of-averages is that they're sensitive to differences between different sample periods. If we have a manual flush button, someone jamming on that button will result in a lot of short sample periods, skewing the results. If there's an automatic flush after an exceptionally slow query, that will skew the results higher. Recomputing the summary statistics lets us avoid any problems due to varying sampling periods.

I see 🤔 , I will come back and update these queries.


docs/RFCS/20210330_sql_stats_persistence.md, line 319 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should probably be a separate process. Look at how we GC from the system.rangelog table; it's decoupled from how we insert into that table.

I was reading the code on the GC for system.rangelog table. I have few questions:

  • It wasn't immediately obvious to me, but I was wondering how fast does the amount of entries in the system.rangelog and system.eventlog grow when a cluster is under load?
  • the GC mechanism ensures that only the node that is the leaseholder of the range 1 can be performing GC task. Is there a reason that job infrastructure is not used in this case?

I'm curious about the growth rate of the size of the system.rangelog table because the reason why GC-on-insert was considered here was that we worried about the rapid rate of the growth of the statistics table size in a large cluster. (i.e. each write for a distinct fingerprint will be amplified by a factor of the size of the cluster). But maybe this problem is already being addressed by the system.rangelog.


docs/RFCS/20210330_sql_stats_persistence.md, line 405 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we have to do some sort of pagination here - instead of reading everything in the current aggregation window, use some limit, merge those with our in-memory data, flush them out, and repeat (or conversely, only query those records for which we have an in-memory entry instead of everything from the current window).

True, I will add it in.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 80 at r4 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Ah good catch, I think the spelling error is why it escaped my search and replaced 🤦

I agree should avoid the experimental_ prefix in anything persisted.


docs/RFCS/20210330_sql_stats_persistence.md, line 109 at r4 (raw file):

disadvantage of the existing designs, such as data-loss on crashes, inaccurate
stats if nodes becomes unavailable etc.

Can you extend this RFC with a discussion of what happens when the stats table(s) are unavailable.

  • does the data continue to accumulate in RAM? By how much?
  • how would an operator notice that something is amiss?
  • can the sync-to-table mechanism be temporarily disabled via a cluster setting?

docs/RFCS/20210330_sql_stats_persistence.md, line 319 at r4 (raw file):

how fast does the amount of entries in the system.rangelog and system.eventlog grow when a cluster is under load?

It varies, of course, depending on activity. However it's unlikely to grow by more than a hundred thousand rows per day, even for large clusters. That is not significant growth.

Is there a reason that job infrastructure is not used in this case?

The rangelog is populated "under" SQL, in KV. We do not have a dependency on the job subsystem at that level.

the reason why GC-on-insert was considered here was that we worried about the rapid rate of the growth of the statistics table size in a large cluster

If that is your worry, then the solution to that is a more aggressive stat GC frequency, perhaps 4-8 times per day instead of 1 per day.

It's also premature optimization maybe ? What about running TPC-C without the mechanism proposed, see how fast the table grows, and then decide based on that?


docs/RFCS/20210330_sql_stats_persistence.md, line 364 at r4 (raw file):

``` sql
INSERT INTO system.sql_stmt_stats (...fields...)

The stages 2-4 should be atomic I think, otherwise you can lose data (if there's a concurrent process doing the same elsewhere).

The atomicity of this entire process should be documented.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 293 at r4 (raw file):

Previously, Azhng (Archer Zhang) wrote…

I see 🤔 , I will come back and update these queries.

One trick is to accumulate separately:

  • a running sum of the latencies
  • a running sum of the sampling periods

then you can divide one by the other to get the average at any point, even if the sampling period varies.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 89 at r4 (raw file):

Previously, Azhng (Archer Zhang) wrote…

I agree that we should definitely have the option to do automatic flush on different triggers.

Though I'm not sure if I entirely understood the reasoning behind having a smaller fixed flush interval. We can still serve fresh stats with a long flush interval by combing the stats read from a SQL query and the in-memory stats. But this does require dependency on RPC fanout. So is the reasoning for having a smaller flush interval so that we can directly serve fresh stats using only SQL query without relying on the RPC fanout?

Yes, I was assuming based on the motivation section that the goal was to eliminate the RPC fanout completely.

If you're thinking of keeping the fanout, then I don't really see why you'd want an explicit SQL-accessible flush (except I suppose for testing).


docs/RFCS/20210330_sql_stats_persistence.md, line 242 at r4 (raw file):

Having a hash-sharded timestamp would make developing the comparison mechanism more difficult.

I'm not sure what you're worried about with a hash-sharded index, but what I'm suggesting here a hash-sharded index by timestamp, and a non-hash-sharded index by (fingerprint, timestamp). Investigating the performance of a single statement would use the latter index and wouldn't need to worry about hashing.

For supporting statement page loads, we would need to perform full table scan anyway, regardless of either using hash-sharded index or not, in order to aggregate statement/transaction statistics across all time period for each fingerprint.

Why for all time? I would think we'd only want to show data from the last day or so on the statements page. But we could have a year or more of history saved here, so it's important to be able to filter somehow. This is what the hash-sharded timestamp index I'm proposing is for.

For finding out all of the most recent fingerprints, I think I'm missing something here. I'm not sure why do we need to find the most recent fingerprints.

What I mean by most recent fingerprints is that I'm assuming we only want to show statements that have been executed in the last day; we don't want to clutter the (default) view with statements that haven't been executed in a long time.

@Azhng Azhng force-pushed the rfc-stats-persistence branch from 57ffbc7 to 65ae889 Compare April 22, 2021 17:28
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 15 at r6 (raw file):

restart/upgrade. This feature would also enable users of CockroachDB to examine and
compare the historical statistics of statements and transactions over time. As
a result, CockroachDB will gain the ability to helps users to easily identify

nit: helps -> help


docs/RFCS/20210330_sql_stats_persistence.md, line 26 at r6 (raw file):

console. However, after the retention period for the collected statistics
expires, users are no longer to be able to access these statistics. There are
few significant problems with the current setup:

nit: few -> a few


docs/RFCS/20210330_sql_stats_persistence.md, line 45 at r6 (raw file):

      solely relying on DB console to perform slicing-n-dicing of the
      statistics data is not scalable.
1. As CockroachDB moving toward multi-tenant architecture, relying on fanout RPC

nit: As CockroachDB moving toward multi-tenant architecture -> As CockroachDB is moving toward a multi-tenant architecture


docs/RFCS/20210330_sql_stats_persistence.md, line 50 at r6 (raw file):

   communication implementation, which is not ideal.

The persistence of SQL statistics in a CockroachDB system table can addresses existing

nit: can addresses -> can address


docs/RFCS/20210330_sql_stats_persistence.md, line 54 at r6 (raw file):

1. **Usability**: with DistSQL we will be able to process more complex queries
  to answer the questions users might have for the performance of their queries
  overtime.

overtime -> over time


docs/RFCS/20210330_sql_stats_persistence.md, line 66 at r6 (raw file):

* Collected statistics should be able to answer users' potential questions for
  their queries overtime through both DB Console and SQL shell.

overtime -> over time


docs/RFCS/20210330_sql_stats_persistence.md, line 72 at r6 (raw file):

  query/txn performance.

* There is a need for mechanism to prune old statistics data to reduce the

nit: mechanism -> a mechanism


docs/RFCS/20210330_sql_stats_persistence.md, line 73 at r6 (raw file):

* There is a need for mechanism to prune old statistics data to reduce the
  burden on storage space. The setting for pruning mechanism should also be

nit: pruning mechanism -> the pruning mechanism


docs/RFCS/20210330_sql_stats_persistence.md, line 85 at r6 (raw file):

Currently, each CockroachDB node stores in-memory statistics for transactions and
statements in which the node is the gateway for. The in-memory statistics is

nit: statistics is -> statistics are


docs/RFCS/20210330_sql_stats_persistence.md, line 104 at r6 (raw file):

queries in order to minimize read-write contention. However, for the most
up-to-date statistics, we still need to utilize RPC fanout to retrieve the
in-memory statistics from each node. The pros for this options is that this is

nit: this options -> this option


docs/RFCS/20210330_sql_stats_persistence.md, line 177 at r6 (raw file):

    app_name       STRING NOT NULL,
    fingerprint    INT NOT NULL,
    created_at     TIMESTAMP NOT NULL,

should you also store the start and end time of the bucket? that seems more important than creation time, since one entry could represent the aggregation of several stats collections


docs/RFCS/20210330_sql_stats_persistence.md, line 179 at r6 (raw file):

    created_at     TIMESTAMP NOT NULL,

    -- metadata

Other than statement_ids, the rest of this seems more similar to the stats columns in system.sql_stmt_stats rather than metadata.


docs/RFCS/20210330_sql_stats_persistence.md, line 241 at r6 (raw file):

The table's primary key is composed of (app_name, fingerprint, created_at).

What is the app_name?


docs/RFCS/20210330_sql_stats_persistence.md, line 243 at r6 (raw file):

The table's primary key is composed of `(app_name, fingerprint, created_at)`.
This is to avoid having all nodes writing stats to the same range at the same
time, which would result in write-write contentions and range hotspot.

nit: contentions and range hotspot -> contention and range hotspots


docs/RFCS/20210330_sql_stats_persistence.md, line 246 at r6 (raw file):

The metadata fields record high-level information about the queries with given
statement fingerprint.

How is this aggregated across multiple runs?


docs/RFCS/20210330_sql_stats_persistence.md, line 402 at r6 (raw file):

* Currently, this schema does not enforce foreign key constraints between
  the transaction statistics and statement statistics. This means that it is
  possible for an transaction stats entry to reference a statement stats entry

nit: an transaction -> a transaction


docs/RFCS/20210330_sql_stats_persistence.md, line 420 at r6 (raw file):

  proven)So if we instead delete all the stats stats belonging to the oldest
  aggregation window, we can ensure that all the statement fingerprints
  referenced by transactions are valid in the statement table.

have you considered removing multiple rows at once by aggregating at larger bucket sizes for old data? e.g., if the bucket size for recent data is 5 minutes, you could use a bucket size of 1 hour or larger for historical data. This would allow you to keep around more historical data.

This might also be something you could add later as a future enhancement, and you might want to store it in a different table.


docs/RFCS/20210330_sql_stats_persistence.md, line 437 at r6 (raw file):

* Should we store some sort of query plan hash so we can track how the query
  plan has changed for the same query fingerprint?

how will you aggregate these plans if there are multiple plans per fingerprint per time window?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 319 at r4 (raw file):

What about running TPC-C without the mechanism proposed, see how fast the table grows, and then decide based on that?

A different mode to consider is one where an ORM is generating a distinct query per invocation. I suspect that there are use cases where that happens.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @maryliag, and @rytaft)


docs/RFCS/20210330_sql_stats_persistence.md, line 242 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Having a hash-sharded timestamp would make developing the comparison mechanism more difficult.

I'm not sure what you're worried about with a hash-sharded index, but what I'm suggesting here a hash-sharded index by timestamp, and a non-hash-sharded index by (fingerprint, timestamp). Investigating the performance of a single statement would use the latter index and wouldn't need to worry about hashing.

For supporting statement page loads, we would need to perform full table scan anyway, regardless of either using hash-sharded index or not, in order to aggregate statement/transaction statistics across all time period for each fingerprint.

Why for all time? I would think we'd only want to show data from the last day or so on the statements page. But we could have a year or more of history saved here, so it's important to be able to filter somehow. This is what the hash-sharded timestamp index I'm proposing is for.

For finding out all of the most recent fingerprints, I think I'm missing something here. I'm not sure why do we need to find the most recent fingerprints.

What I mean by most recent fingerprints is that I'm assuming we only want to show statements that have been executed in the last day; we don't want to clutter the (default) view with statements that haven't been executed in a long time.

Given our focus on cloud (serverless or otherwise), have we considered persisting this data outside CockroachDB?
It is typical to store data with a large number of columns in a columnar store with cheap scans and no indexes. And that would decouple availability of this data from availability of the main cluster, which is desirable for observability data.
Data collection in this model typically does not involve any contention since it is akin to appending to a log -- an initial flushed row could correspond to a 1min time bucket. Rows that are older than N hours would be read and rewritten to coarser time buckets.


docs/RFCS/20210330_sql_stats_persistence.md, line 241 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What is the app_name?

Same question. Also, is created_at the start time of the bucket?


docs/RFCS/20210330_sql_stats_persistence.md, line 246 at r6 (raw file):

The metadata fields record high-level information about the queries with given
statement fingerprint.

Does having the same statement fingerprint imply that all these metadata values will be the same across statements?

@tbg tbg changed the title rfc stats persistence RFCS: sql stats persistence Apr 23, 2021
@Azhng Azhng force-pushed the rfc-stats-persistence branch from 65ae889 to 373ea01 Compare April 27, 2021 03:18
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 80 at r4 (raw file):

Previously, knz (kena) wrote…

I agree should avoid the experimental_ prefix in anything persisted.

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 89 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, I was assuming based on the motivation section that the goal was to eliminate the RPC fanout completely.

If you're thinking of keeping the fanout, then I don't really see why you'd want an explicit SQL-accessible flush (except I suppose for testing).

I see. Will remove the manual trigger 👍


docs/RFCS/20210330_sql_stats_persistence.md, line 109 at r4 (raw file):

Previously, knz (kena) wrote…

Can you extend this RFC with a discussion of what happens when the stats table(s) are unavailable.

  • does the data continue to accumulate in RAM? By how much?
  • how would an operator notice that something is amiss?
  • can the sync-to-table mechanism be temporarily disabled via a cluster setting?

Good point. Will add those components in.


docs/RFCS/20210330_sql_stats_persistence.md, line 242 at r4 (raw file):

Previously, sumeerbhola wrote…

Given our focus on cloud (serverless or otherwise), have we considered persisting this data outside CockroachDB?
It is typical to store data with a large number of columns in a columnar store with cheap scans and no indexes. And that would decouple availability of this data from availability of the main cluster, which is desirable for observability data.
Data collection in this model typically does not involve any contention since it is akin to appending to a log -- an initial flushed row could correspond to a 1min time bucket. Rows that are older than N hours would be read and rewritten to coarser time buckets.

  • Updated RFC to use hard-sharded PK on timestamp to better serve loading the statement page
  • introduced (fingerprint, timestamp) index to better serve the request for loading the performance of a query over time

As for persisting the data outside of CRDB, @jordanlewis , @kevin-v-ngo do you guys have any thoughts in this?


docs/RFCS/20210330_sql_stats_persistence.md, line 319 at r4 (raw file):

This should probably be a separate process. Look at how we GC from the system.rangelog table; it's decoupled from how we insert into that table.

Decoupled GC operation from flush operation

A different mode to consider is one where an ORM is generating a distinct query per invocation. I suspect that there are use cases where that happens.

Is there a workload I can run to simulate this type of scenario?


docs/RFCS/20210330_sql_stats_persistence.md, line 364 at r4 (raw file):

Previously, knz (kena) wrote…

The stages 2-4 should be atomic I think, otherwise you can lose data (if there's a concurrent process doing the same elsewhere).

The atomicity of this entire process should be documented.

Sounds good.


docs/RFCS/20210330_sql_stats_persistence.md, line 54 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

overtime -> over time

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 66 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

overtime -> over time

Done.


docs/RFCS/20210330_sql_stats_persistence.md, line 177 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

should you also store the start and end time of the bucket? that seems more important than creation time, since one entry could represent the aggregation of several stats collections

Makes sense. I think I can store two timestamps:first_run and last_run, and last_run timestamp can replace created_at timestamp as part of the primary key.


docs/RFCS/20210330_sql_stats_persistence.md, line 179 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Other than statement_ids, the rest of this seems more similar to the stats columns in system.sql_stmt_stats rather than metadata.

Ah good catch. I should have deleted this line.


docs/RFCS/20210330_sql_stats_persistence.md, line 241 at r6 (raw file):

What is the app_name?

app_name is the name of the application that issued the query. Since same query from different applications will have the same fingerprint, we added app_name here to disambiguate that

Same question. Also, is created_at the start time of the bucket?

created_at was originally the timestamp where the entry is being created. But I think it makes more sense to replace it with a last_run timestamp.

The time bucket is implicitly calculated using the aggregation interval, i.e. [ts + (ts % interval), ts + interval + (ts % interval),


docs/RFCS/20210330_sql_stats_persistence.md, line 246 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

How is this aggregated across multiple runs?

The idea is that when metadata fields change, it will result in the change of fingerprint. However, currently, full_scan, distSQL, vec and opt fields are not being used when calculating the fingerprint. This can be solved by introducing the plan_hash field into the PK and index.


docs/RFCS/20210330_sql_stats_persistence.md, line 246 at r6 (raw file):

Previously, sumeerbhola wrote…

Does having the same statement fingerprint imply that all these metadata values will be the same across statements?

Yes, (mostly). Right now full_scan, distSQL, vec and opt fields are is not being used to calculate fingerprint. I will also introduce another column called plan_hash to reflect the change on those fields.


docs/RFCS/20210330_sql_stats_persistence.md, line 420 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

have you considered removing multiple rows at once by aggregating at larger bucket sizes for old data? e.g., if the bucket size for recent data is 5 minutes, you could use a bucket size of 1 hour or larger for historical data. This would allow you to keep around more historical data.

This might also be something you could add later as a future enhancement, and you might want to store it in a different table.

Yes this would be ideal. I will add a section describing the downsampling. Though for the MVP I think I will still only keep the simplest approach.


docs/RFCS/20210330_sql_stats_persistence.md, line 437 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

how will you aggregate these plans if there are multiple plans per fingerprint per time window?

Good point. I think we want to have a separate column called plan_hash as part of the PK and index. So same query with different plans will be grouped into different groups. But we can still inspect the change of plans for a given query over a period of time.

I filed this issue to introduce a plan hash

Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rytaft, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 23 at r7 (raw file):

Currently, CockroachDB stores the statement and transaction metrics in memory. The
retention policy for the in-memory storage is one hour by default. During this
one-hour period, users can query this in-memory data structure through DB

Nit: through the DB Console


docs/RFCS/20210330_sql_stats_persistence.md, line 29 at r7 (raw file):

retention policy for the in-memory storage is one hour by default. During this
one-hour period, users can query this in-memory data structure through DB

Might be worth explaining that people want the context to know if the current results are bad and the way they can do that is to compare to historical. In particular this would be an operator experience as they do not always have familiarity with the workload


docs/RFCS/20210330_sql_stats_persistence.md, line 52 at r7 (raw file):

The persistence of SQL statistics in a CockroachDB system table can address existing
drawbacks. CockroachDB will gain improvement in two areas:
1. **Usability**: with DistSQL we will be able to process more complex queries

Could you explain how this is addressed?


docs/RFCS/20210330_sql_stats_persistence.md, line 80 at r7 (raw file):

## Design Overview

Two new system tables `system.sql_stmt_stats` and

Nit: I'd prefer if we spell out statement and transaction. Our current tables mix between the two (node_transactions and node_txn_stats) but I think its cleaner to spell them both out.


docs/RFCS/20210330_sql_stats_persistence.md, line 94 at r7 (raw file):

are flushed into system tables in one of the following scenarios:
1. at the end of a flush fixed interval (determined by a cluster setting).
1. when we experience memory pressure.

Could we be more precise here? How much pressure?


docs/RFCS/20210330_sql_stats_persistence.md, line 95 at r7 (raw file):

1. at the end of a flush fixed interval (determined by a cluster setting).
1. when we experience memory pressure.
1. when a query takes unusually long to execute.

What is unusually long? That seems like a judgment that requires context


docs/RFCS/20210330_sql_stats_persistence.md, line 106 at r7 (raw file):

When DB Console issues fetch requests to CockroachDB node through HTTP endpoint,
the persisted statistics data can be fetched using `AS OF SYSTEM TIME -10s`

Why not use follower reads here?


docs/RFCS/20210330_sql_stats_persistence.md, line 113 at r7 (raw file):

this. Consequentially, this means that this option also inherits the
disadvantage of the existing designs, such as data-loss on crashes, inaccurate
stats if nodes become unavailable, etc.

Just for what is in memory though right? How much will be in memory before flushed to disk?


docs/RFCS/20210330_sql_stats_persistence.md, line 121 at r7 (raw file):

``` SQL
CREATE TABLE system.sql_stmt_stats (
    last_run    TIMESTAMP NOT NULL,

We generally recommend using timestamptz with users so that noting is messed with timezones, is that relevant here?


docs/RFCS/20210330_sql_stats_persistence.md, line 122 at r7 (raw file):

CREATE TABLE system.sql_stmt_stats (
    last_run    TIMESTAMP NOT NULL,
    fingerprint INT NOT NULL,

should this be fingerprint_id? And a UUID? I believe we have transactions as UUIDs given high volumes and to avoid contention


docs/RFCS/20210330_sql_stats_persistence.md, line 123 at r7 (raw file):

    last_run    TIMESTAMP NOT NULL,
    fingerprint INT NOT NULL,
    app_name    STRING NOT NULL,

What about database or schema?


docs/RFCS/20210330_sql_stats_persistence.md, line 124 at r7 (raw file):

    fingerprint INT NOT NULL,
    app_name    STRING NOT NULL,
    plan_hash   INT NOT NULL,

What is plan_hash? Is it an id for the logical plan? Will the same fingerprint with different plans be stored separately?


docs/RFCS/20210330_sql_stats_persistence.md, line 228 at r7 (raw file):

    PRIMARY KEY (last_run, fingerprint, plan_hash, app_name) 
      USING HASH WITH BUCKET_COUNT = 8,

very cool to see hash sharded indexes in use!


docs/RFCS/20210330_sql_stats_persistence.md, line 234 at r7 (raw file):

CREATE TABLE system.sql_txn_stats (
    last_run    TIMESTAMP NOT NULL,
    fingerprint INT NOT NULL,

same id comment above


docs/RFCS/20210330_sql_stats_persistence.md, line 283 at r7 (raw file):

            "max_retries":         { "type": "number" },
            "num_rows":            { "$ref": "#/definitions/numeric_stats" },
            "service_lat":         { "$ref": "#/definitions/numeric_stats" },

Nit: i think we should spell out latency. One way to think about this is that we will be reading these values forever and its worth it to make it super easy to consume as opposed to saving a few key strokes the one time we implement it.


docs/RFCS/20210330_sql_stats_persistence.md, line 340 at r7 (raw file):

The first two columns of the primary keys for both tables contain
`last_run` time stamp and `fingerprint`. The primary key utilizes hash-sharding
with 8 buckets. There are two reasons for this design:

Why 8 buckets? Will this need to change with data size?


docs/RFCS/20210330_sql_stats_persistence.md, line 373 at r7 (raw file):

### Example queries that can be used to answer query performance-related questions:

#### Querying attributes over a time period for a statement.

Should these be defined views that we implement on the table? The advantage is that its easily discoverable in the SQL shell for how to find the right info and it can still power the DB Console


docs/RFCS/20210330_sql_stats_persistence.md, line 395 at r7 (raw file):

SELECT DISTINCT
  fingerprint,
  plan

i think this should be plan_id per the comment above


docs/RFCS/20210330_sql_stats_persistence.md, line 532 at r7 (raw file):

operation.

Also, the primary keys for statistics tables include the field for `node_id`,

Do we need to collect tenant_id? How does this all work in multi-tenant serverless? I'd like to see some discussion on that


docs/RFCS/20210330_sql_stats_persistence.md, line 566 at r7 (raw file):

  when GC job downsamples statistics, how much larger do we want to increase
  the aggregation interval by.
* `sql.stats.gc.max_agg_interval`: this settings dictates maximum interval for

Nit: s settings/setting


docs/RFCS/20210330_sql_stats_persistence.md, line 611 at r7 (raw file):

degrade our service quality gracefully.

* Read path: if we are to lose quorum, CockroahcDB will reject any future

Nit CockroachDB


docs/RFCS/20210330_sql_stats_persistence.md, line 652 at r7 (raw file):

* Since stats entries are not mission-critical data for the operation of the
  database, we can perhaps tolerate a certain degree of inconsistency of data.

I dont think we can. It hurts our own brand of always consistent and observability is a key part of a mission critical application and as well as perceived quality from customers

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 142 at r16 (raw file):

Previously, kevin-v-ngo wrote…

Are we following any system table naming conventions by prefixing with 'sql'? If not, let's try and be consistent with our crdb_internal tables today for in-memory statistics. "system.statement_statistics" and "system.transaction_statistics".

+1


docs/RFCS/20210330_sql_stats_persistence.md, line 144 at r16 (raw file):

Previously, kevin-v-ngo wrote…

What does fingerprint_id map to in crdb_internal.node_statement_statistics? key? How does the user join between the in-memory view we have today with this view?

Also curious about this.


docs/RFCS/20210330_sql_stats_persistence.md, line 251 at r16 (raw file):

    plan BYTES NOT NULL,

    PRIMARY KEY (aggregated_ts, fingerprint_id, plan_hash, app_name, node_id)

Why would we have plan_hash in the key? Wouldn't we want it in the value, so we could see how the plan changes given a fingerprint over time?

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 658 at r13 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Makes sense. This is indeed a better approach. I added a section in the Future Work to incorporate it later. Though I couldn't find the name of this algo either :/

this algorithm sounds like exponential smoothing https://en.wikipedia.org/wiki/Exponential_smoothing

https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average is a more specific example.


docs/RFCS/20210330_sql_stats_persistence.md, line 136 at r14 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Interesting 🤔 . Though do we have any existing infrastructure set up that I can directly build on top of as of today? Or are you suggesting to build something new from scratch?

If it's the latter then, I don't think SQL Observability team has enough capacity and bandwidth to take ownership of it for this release cycle.

I wasn't imagining anything complicated -- it could just be a new database that you make, that's only usable by root (or something). I could see this opening up many more problems than it's worth though!


docs/RFCS/20210330_sql_stats_persistence.md, line 579 at r14 (raw file):

Previously, Azhng (Archer Zhang) wrote…

how do we know what the oldest data is?

We can use first_executed_at column. Since we don't overwrite existing rows that are outside of the current aggregation window, (other than during cleanup compacting/downsampling operation), for each fingerprint id, we will have multiple entries for it in our system table. This is what enables us to perform comparison of historical data for a given specific fingerprint in the future. That also means that finding out the oldest data is relatively simple because it's the first column in our primary key.

Ah I see, so I guess I was assuming this was going to be cleaned up based on least-recently-used -- but it looks like instead it's more like a FIFO cleanup.

@Azhng Azhng force-pushed the rfc-stats-persistence branch from e6f18f5 to 2dc0781 Compare May 20, 2021 01:43
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 658 at r13 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this algorithm sounds like exponential smoothing https://en.wikipedia.org/wiki/Exponential_smoothing

https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average is a more specific example.

Thanks! Added the link to the article.


docs/RFCS/20210330_sql_stats_persistence.md, line 136 at r14 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

I wasn't imagining anything complicated -- it could just be a new database that you make, that's only usable by root (or something). I could see this opening up many more problems than it's worth though!

Hmm. Well, I will defer this to either Schema team or Queries team to make this call 😛 .


docs/RFCS/20210330_sql_stats_persistence.md, line 579 at r14 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

Ah I see, so I guess I was assuming this was going to be cleaned up based on least-recently-used -- but it looks like instead it's more like a FIFO cleanup.

Yep


docs/RFCS/20210330_sql_stats_persistence.md, line 144 at r16 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Also curious about this.

Kevin and I chatted offline about this because Reviewable discarded all my comments when I published it for some reason. 😢

fingerprint_id maps to statement_id column currently exists in crdb_internal.node_statement_statistics. However, fingerprint_id is of type BYTES whereas statement_id has type STRING. So there joining them might require a casting somewhere.

However, since we are already introducing a new system view where we can get cluster level stats, we can just ensure that the types between the new system view and the system tables are the same. We can also deprecate the old node-local system view and tell users to use the new one.


docs/RFCS/20210330_sql_stats_persistence.md, line 251 at r16 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Why would we have plan_hash in the key? Wouldn't we want it in the value, so we could see how the plan changes given a fingerprint over time?

We want it in the key precisely because we want to see plan change given a fingerprint_id. This is because we do not use plan_hash during the computation for the fingerprint_id. This means that a statement with different plans with have same fingerprint_id. So in order to distinguish between same statements with different plans, we need to have the plan_hash in the primary key.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 251 at r16 (raw file):

Previously, Azhng (Archer Zhang) wrote…

We want it in the key precisely because we want to see plan change given a fingerprint_id. This is because we do not use plan_hash during the computation for the fingerprint_id. This means that a statement with different plans with have same fingerprint_id. So in order to distinguish between same statements with different plans, we need to have the plan_hash in the primary key.

If it's in the (aggregate) value, you'd be able to see at what point the plan_hash changes by looking at which bucket the value changes, right? Maybe you'd want the aggregate function to be an ARRAY_AGG - basically store all values of plan hash that you see per fingerprint. It seems overkill to have it in the key, to me, since the key is the scaling parameter of the size of the data, isn't it?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! As usual, some nits from me :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, and @sumeerbhola)


docs/RFCS/20210330_sql_stats_persistence.md, line 249 at r16 (raw file):

Previously, kevin-v-ngo wrote…

BYTES or JSONB?

I think it should be BYTES because it is likely to be the encoded plan string (maybe the DistSQL diagram from EXPLAIN (DISTSQL)).


docs/RFCS/20210330_sql_stats_persistence.md, line 25 at r17 (raw file):

this one-hour period, the user can query statistics stored in memory through the
DB Console. However, after the retention period for the collected statistics
expires, users are no longer to be able to access these statistics. There are a

nit: s/no longer to be able/no longer able/.


docs/RFCS/20210330_sql_stats_persistence.md, line 85 at r17 (raw file):

We will also introduce the following new cluster settings:
* `sql.stats.flush_interval`: this dictates how often each node flushes

nit: I want to point out that we already have several cluster settings with sql.stats prefix, and the existing settings are about the table statistics. I wonder whether changing the second word here in the setting name is worth it in order to not clash with table stats.


docs/RFCS/20210330_sql_stats_persistence.md, line 88 at r17 (raw file):

  stats to system tables.
* `sql.stats.memory_limit`: this setting limits the amount of statistics data
  each node stores locally in their memories.

nit: s/memories/memory/.


docs/RFCS/20210330_sql_stats_persistence.md, line 127 at r17 (raw file):

It is also worth noting that keeping the RPC fanout in this case will not make
the existing situation worse since the response from RPC fanout will still only
contain the statistics store in-memory in all other nodes.

nit: s/store/stored/.


docs/RFCS/20210330_sql_stats_persistence.md, line 552 at r17 (raw file):

### Writing in-memory stats to system tables

When we flush in-memory stats to a system table, we executed everything within

nit: s/executed/execute/.


docs/RFCS/20210330_sql_stats_persistence.md, line 697 at r17 (raw file):

* In order to retrieve the most up-to-date statistics that are yet to be
  flushed to system table, we would be fall back to using RPC fanout to contact

nit: s/would be fall/would fall/.


docs/RFCS/20210330_sql_stats_persistence.md, line 705 at r17 (raw file):

* Instead of deleting the oldest stats entries from the system table in the
  stage 5 of the flush operation, we can alternatively delete all stats in the

What is "stage 5"?

Release note: None
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, @sumeerbhola, and @yuzefovich)


docs/RFCS/20210330_sql_stats_persistence.md, line 251 at r16 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

If it's in the (aggregate) value, you'd be able to see at what point the plan_hash changes by looking at which bucket the value changes, right? Maybe you'd want the aggregate function to be an ARRAY_AGG - basically store all values of plan hash that you see per fingerprint. It seems overkill to have it in the key, to me, since the key is the scaling parameter of the size of the data, isn't it?

Hmm. I think the issue with that is the execution stats for different plans for the same fingerprint will all be merged into an single entry. It means it is going to be difficult for users to actually understand how the performance of a query has changed given different plan_hash.


docs/RFCS/20210330_sql_stats_persistence.md, line 85 at r17 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I want to point out that we already have several cluster settings with sql.stats prefix, and the existing settings are about the table statistics. I wonder whether changing the second word here in the setting name is worth it in order to not clash with table stats.

Hmm, I suppose we can call them sql.exec_stats, but we already use ExecStats to refer to the sampled statistics. I'm kind of out of idea here. (the only other name I can come up with is sql.sql_stats 🤦 ), any name suggestion here?


docs/RFCS/20210330_sql_stats_persistence.md, line 705 at r17 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What is "stage 5"?

Oops. This should have been the "clean up operation". The stage 5 was referring to the last step of the flush operation from an early version of this RFC. It has been since removed and updated.

Fixed.

@Azhng Azhng force-pushed the rfc-stats-persistence branch from 2dc0781 to 7fe37aa Compare May 20, 2021 05:35
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r18.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @awoods187, @Azhng, @bdarnell, @jordanlewis, @kevin-v-ngo, @knz, @maryliag, @rafiss, @sumeerbhola, and @yuzefovich)

@Azhng
Copy link
Contributor Author

Azhng commented May 22, 2021

Big thanks to everyone who has helped reviewing and improving this RFC! Y'all are amazing ❤️

bors r+

Azhng added a commit to Azhng/cockroach that referenced this pull request May 22, 2021
This PR creates two new system tables: system.sql_statement_stats
and system.sql_transaction_stats per RFC cockroachdb#63752.

This is the initial step that addresses cockroachdb#64743.

Release note: None
@craig
Copy link
Contributor

craig bot commented May 22, 2021

Build succeeded:

@craig craig bot merged commit 3cafc38 into cockroachdb:master May 22, 2021
Azhng added a commit to Azhng/cockroach that referenced this pull request Jun 23, 2021
This PR creates two new system tables: system.statement_stats
and system.transaction_stats per RFC cockroachdb#63752.

This is the initial step that addresses cockroachdb#64743.

Release note: None
Azhng added a commit to Azhng/cockroach that referenced this pull request Jul 5, 2021
This PR creates two new system tables: system.statement_stats
and system.transaction_stats per RFC cockroachdb#63752.

This is the initial step that addresses cockroachdb#64743.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 6, 2021
65374: sql: create system tables for SQL stats r=Azhng a=Azhng

This PR creates two new system tables: system.sql_statement_stats
and system.sql_transaction_stats per RFC #63752.

This is the initial step that addresses #64743.

Release note: None

67175: cloud: bump orchestrator version to 21.1.5 r=JuanLeon1 a=JuanLeon1



67194: sqlsmith: make TLP predicates immutable r=mgartner a=mgartner

Previously, `GenerateTLP` could generate predicates that were not
immutable. For example, a predicate could include `random()`. The TLP
method only works correctly if predicates are immutable. A non-immutable
predicate can cause TLP to erroneously report a correctness bug.

Release note: None

Co-authored-by: Azhng <[email protected]>
Co-authored-by: Juan Leon <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: statement stats persistence RFC