rfcs: partial statistics collection #75625

mgartner · 2022-01-27T18:44:29Z

Release note: None

cockroach-teamcity · 2022-01-27T18:44:39Z

This change is

yuzefovich

Nice work! Did a quick pass, and I'll defer to others for more careful review.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

docs/RFCS/20220126_incremental_statistics_collection.md, line 88 at r1 (raw file):

This specifies that the database should use an index on a to collect statistics

nit: highlight column "a" as a throughout this paragraph.

docs/RFCS/20220126_incremental_statistics_collection.md, line 125 at r1 (raw file):

```sql
CREATE STATISTICS my_stat FROM t@a_b_idx
WITH OPTIONS INCREMENTAL GREATER THAN 1 AND LESS THAN 10

This incremental option defines the extreme values only for the first column in the index (a in this case), right? Do we wanna allow constraining values for multiple columns from the index? Or is it captured by CONSTRAINT option mentioned above?

docs/RFCS/20220126_incremental_statistics_collection.md, line 145 at r1 (raw file):

will return an error). Next, we will update the statistic with the new
incremental info (details on this below), and update the corresponding row in
system.table_statistics. To indicate that this statistic includes an incremental

nit: highlight the table name.

docs/RFCS/20220126_incremental_statistics_collection.md, line 446 at r1 (raw file):

decide to go the route of mutation sampling.

# Unresolved questions

For full stats collection we use the inconsistent scans and AOST in order to reduce the impact on the foreground traffic, do we plan to do the same for incremental stats? If we do, then we might want to delay the incremental stats job for some period (e.g. because we want mutation queries that triggered the incremental stats collection to be visible by the scan).

michae2

This makes a lot of sense!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

```sql
CREATE STATISTICS my_stat ON a FROM t WITH OPTIONS INCREMENTAL

bikeshedding: This implementation (collecting stats for particular ranges of values) seems more like "partial" stats (in the "partial index" sense) rather than "incremental" stats (in the "incremental backup" sense) to me. What I mean is: the INCREMENTAL syntax makes me think we will scan the whole table and only collect stats on "new" values, like incremental backup. But this implementation has nothing to do with MVCC versions of a table. It uses logical conditions, similar to partial indexes. In fact, it could be extended to use any predicate, not just < or >, similar to partial indexes (and stats collected on other predicates might also be useful, if they match a query predicate). So personally I would try to use syntax matching partial indexes.

Code quote:

```sql
CREATE STATISTICS my_stat ON a FROM t WITH OPTIONS INCREMENTAL

___
*[docs/RFCS/20220126_incremental_statistics_collection.md, line 228 at r1](https://reviewable.io/reviews/cockroachdb/cockroach/75625#-MuWvY7R3A1dLxRxUaK5:-MuWvY7R3A1dLxRxUaK6:b-b5kulw) ([raw file](https://github.com/cockroachdb/cockroach/blob/2151eeb2fe0ff59dd8c68451383125f76bbb3d16/docs/RFCS/20220126_incremental_statistics_collection.md#L228)):*
> ```Markdown
> than a range for incremental stats, and determine the range to scan from the
> boundaries of the existing histogram bucket that the value belongs to. This
> would ensure that existing and new bucket boundaries always line up.
> ```

This strategy could also be used when the user provides a range, by extending the range in both directions to bucket boundaries.



_Code quote:_
```Markdown
Alternatively, we require that users provide a specific value (or values) rather
than a range for incremental stats, and determine the range to scan from the
boundaries of the existing histogram bucket that the value belongs to. This
would ensure that existing and new bucket boundaries always line up.

docs/RFCS/20220126_incremental_statistics_collection.md, line 243 at r1 (raw file):

best we can do in most cases is compare the previous values to the incremental
values, and if the previous values are smaller, we can replace them with the
incremental values. For example, if the previous values for the full table were

It's too bad we have to throw away useful information, just to splice the new stats into the old stats. (Especially since we're throwing away the most recent information to favor the old information!) If a query were only touching the area covered by the new stats, it would be nice if the optimizer could use the new stats regardless of whether the old stats had histograms or not.

Perhaps instead of updating the old stats we could create a new row in system.table_statistics (marked as incremental). Then the optimizer would have to decide how to combine the old and new stats for a given query. This would move that splicing logic from stats collection to planning, but would allow us to retain the new stats regardless of how the old stats looked. Additionally, system.table_statistics would then keep all of the incremental stats collections, allowing someone to query them to learn how a table had changed over time, or when incremental stats had been collected.

docs/RFCS/20220126_incremental_statistics_collection.md, line 381 at r1 (raw file):

We don’t support maintenance of such backing samples today, so this solution
would require a lot more effort to implement than the solutions proposed in this
RFC.

I'm really glad we're not trying to keep live stats. This leads to query plans changing all the time without warning, not to mention requires adding complex coordination or additional writes to all mutations. From personal experience it's a pain to debug and help customers understand. The approach in this RFC seems like it will still be easy to understand, debug, and implement, like our current system.

rharding6373

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

docs/RFCS/20220126_incremental_statistics_collection.md, line 228 at r1 (raw file):

than a range for incremental stats, and determine the range to scan from the
boundaries of the existing histogram bucket that the value belongs to. This
would ensure that existing and new bucket boundaries always line up.

Is there a situation in which we may want to split buckets during incremental stats collection?

docs/RFCS/20220126_incremental_statistics_collection.md, line 247 at r1 (raw file):

`row_count = 200` and `distinct_count = 20`, we can just update the counts to
match the incremental values. This is safe since we know that the new counts for
the full table will be at least as large as the counts from the subset of the

If the customer is also deleting rows from the table, perhaps at the other extreme end of the range, then wouldn't this not necessarily be true?

rharding6373

Nice writeup!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

rytaft

Thanks for the reviews, and thanks to @mgartner for co-authoring and getting this RFC posted!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

bikeshedding: This implementation (collecting stats for particular ranges of values) seems more like "partial" stats (in the "partial index" sense) rather than "incremental" stats (in the "incremental backup" sense) to me. What I mean is: the INCREMENTAL syntax makes me think we will scan the whole table and only collect stats on "new" values, like incremental backup. But this implementation has nothing to do with MVCC versions of a table. It uses logical conditions, similar to partial indexes. In fact, it could be extended to use any predicate, not just < or >, similar to partial indexes (and stats collected on other predicates might also be useful, if they match a query predicate). So personally I would try to use syntax matching partial indexes.

I like the idea of using partial index syntax! Something like:

CREATE STATISTICS my_stat ON a FROM t WHERE a > 5 AND a < 10;

We'll have to think about how this would look with the phase 1 plan, though, where the range isn't explicitly specified. Maybe we could just say WITH OPTIONS PARTIAL instead of INCREMENTAL. Although if we do that, then we should probably also modify the range-based command for consistency, like this:

CREATE STATISTICS my_stat ON a FROM t WITH OPTIONS PARTIAL WHERE a > 5 AND a < 10;

docs/RFCS/20220126_incremental_statistics_collection.md, line 125 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This incremental option defines the extreme values only for the first column in the index (a in this case), right? Do we wanna allow constraining values for multiple columns from the index? Or is it captured by CONSTRAINT option mentioned above?

A potential problem with constraining more than the first column is that then we won't be able to cleanly update the histogram for the first column (at least not until we support multi-column histograms).

We might want to consider making a special exception for hash and partitioned indexes where we allow constraining the second column but not the first.

docs/RFCS/20220126_incremental_statistics_collection.md, line 228 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Is there a situation in which we may want to split buckets during incremental stats collection?

I think that would be desirable if there are a lot of rows added in a given range. We'd need to have some limit on the total number of buckets, though.

docs/RFCS/20220126_incremental_statistics_collection.md, line 228 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This strategy could also be used when the user provides a range, by extending the range in both directions to bucket boundaries.

+1 I don't see why we need to require that the users provide specific value(s)

docs/RFCS/20220126_incremental_statistics_collection.md, line 243 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

It's too bad we have to throw away useful information, just to splice the new stats into the old stats. (Especially since we're throwing away the most recent information to favor the old information!) If a query were only touching the area covered by the new stats, it would be nice if the optimizer could use the new stats regardless of whether the old stats had histograms or not.

Perhaps instead of updating the old stats we could create a new row in system.table_statistics (marked as incremental). Then the optimizer would have to decide how to combine the old and new stats for a given query. This would move that splicing logic from stats collection to planning, but would allow us to retain the new stats regardless of how the old stats looked. Additionally, system.table_statistics would then keep all of the incremental stats collections, allowing someone to query them to learn how a table had changed over time, or when incremental stats had been collected.

That's a good idea. If we go this route we'd probably want to do the splicing when updating the stats cache to avoid redundant work in the optimizer.

docs/RFCS/20220126_incremental_statistics_collection.md, line 247 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

If the customer is also deleting rows from the table, perhaps at the other extreme end of the range, then wouldn't this not necessarily be true?

No, it's still true. Think about it this way:

Suppose that when the original stats were collected, values ranged from 0 to 100, and there were a total of 1000 rows.

When the incremental stats are collected, suppose values now range from 50 to 150 (in your scenario where the customer is deleting rows on one end of the range). For the incremental collection, we're only scanning from 100 to 150, but suppose there are a total of 2000 rows in this range. We know that there are at least 2000 rows in the whole range from 50 to 150, even though we only scanned the upper half. Therefore it's safe to update the row count from the old value of 1000 to the new value of 2000.

docs/RFCS/20220126_incremental_statistics_collection.md, line 381 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

I'm really glad we're not trying to keep live stats. This leads to query plans changing all the time without warning, not to mention requires adding complex coordination or additional writes to all mutations. From personal experience it's a pain to debug and help customers understand. The approach in this RFC seems like it will still be easy to understand, debug, and implement, like our current system.

👍

docs/RFCS/20220126_incremental_statistics_collection.md, line 446 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

For full stats collection we use the inconsistent scans and AOST in order to reduce the impact on the foreground traffic, do we plan to do the same for incremental stats? If we do, then we might want to delay the incremental stats job for some period (e.g. because we want mutation queries that triggered the incremental stats collection to be visible by the scan).

Yep, we should use the same approach that we use with full stats collections, where we delay the start of collection by the amount of time in the AOST clause

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I like the idea of using partial index syntax! Something like:
CREATE STATISTICS my_stat ON a FROM t WHERE a > 5 AND a < 10;
We'll have to think about how this would look with the phase 1 plan, though, where the range isn't explicitly specified. Maybe we could just say WITH OPTIONS PARTIAL instead of INCREMENTAL. Although if we do that, then we should probably also modify the range-based command for consistency, like this:
CREATE STATISTICS my_stat ON a FROM t WITH OPTIONS PARTIAL WHERE a > 5 AND a < 10;

I had similar thoughts to @michae2. Upon hearing about this project, I imagined the implementation being something like 1) open a time-bound iterator between (last_stats_time, now-30s] on the primary index, 2) iterate while throttling, 3) sample values, 4) merge back in with the rest of the stats somehow. The "somehow" here is completely handwavy — I don't know whether the data model of table statistics supports incremental maintenance.

If it does, the key idea here would be that by only collecting a slice of stats for a particular window of MVCC time (i.e. the new versions), we can use a time-bound MVCC iterator. This allows us to avoid scanning most files in each range's LSM. As a first-order approximation for short enough time windows (in reality, this is all data-dependent), we've seen that this can reduce the cost of iteration by around two orders of magnitude, making it sufficiently cheap to run in an incremental manner.

Out of curiosity, have we explored similar ideas in the past? You discuss "mutation sampling" below, which seems related. The original SQL Optimizer Statistics RFC also discussed range-level compaction-time stats maintenance.

rytaft · 2022-01-31T16:14:23Z

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):
This is a cool idea, but I'm a bit skeptical it will work (if I understand correctly how the time bound iterator works). In order for this to work, we'd need to maintain both positive and negative samples when performing the scan to account for the values that were both added and deleted, thus enabling a correct update of the histogram. That in itself isn't a problem, but the issue is in how we would find the negative samples. For each deleted or updated value seen by the iterator, we'd need to retrieve the previous value of that key to know which value to store as a negative sample, which (if it's even possible) seems like it would reduce the performance benefit of the time bound iterator.

This also wouldn't work for updating distinct counts, so we'd have to just assume that the fraction of distinct values didn't change. It's also not guaranteed to work if we add new types of stats (e.g., heavy hitters) in the future.

A similar idea that might work a bit better was proposed by @cucaroach -- he suggested using CDC to maintain up-to-date stats outside of the database (e.g., using an approach from one of the papers below for maintaining up-to-date stats over streaming data), and then periodically (e.g., when the q-error for the delta between the current and last inserted stats passes some threshold) update the system.table_statistics table with the updated stats. I think there is still a lot of complexity with this approach, but it could potentially be a useful future enterprise feature for customers who need very accurate stats.

Anyway, let me know your thoughts. I think it would be awesome if we could maintain perpetually up-to-date stats, but I think doing so is a lot more complicated than the partial collections proposed here. The benefit of the approaches in this RFC (especially phase 1) is that they fit in nicely with how we're already collecting stats and allow us to reuse most of that existing infrastructure. Phase 1 directly attacks a problem we've seen with customers who are continually adding new data at the extremes of a column (e.g., recent timestamps), so it should allows us to provide immediate customer value.

Out of curiosity, have we explored similar ideas in the past? You discuss "mutation sampling" below, which seems related.

@andy-kimball proposed this idea of mutation sampling back when we were first discussing a plan for automatic stats collection. We decided to go with the approach we use today since CREATE STATISTICS was already implemented, and it seemed lower risk.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

For each deleted or updated value seen by the iterator, we'd need to retrieve the previous value of that key to know which value to store as a negative sample, which (if it's even possible) seems like it would reduce the performance benefit of the time bound iterator.

Good point! This is a similar problem to what we are running into in #69191. The idea that PR takes is to use a separate non-time-bound iterator to look up the previous version for each new that the TBI finds. It does reduce the performance benefit of the time-bound iterator, but in the vast majority of cases, this effect should be marginal. Even with the second iterator, this approach still performs O(num_new_versions) work, as opposed to the O(num_all_versions) work that a full stats computation performs.

A similar idea that might work a bit better was proposed by @cucaroach -- he suggested using CDC to maintain up-to-date stats outside of the database

Aren't these two approaches analogous? A CDC-based approach will require the WITH DIFF option, which means that we'd be paying for the same per-version lookup either way.

With either approach, the challenge seems to be incremental maintenance of stats.

The benefits I see to the periodic TBI scan approach are that 1) we avoid pushing any work on to writers, 2) we can sample earlier in the process (e.g. before hitting the network and even before the prev version lookup), 3) we can throttle earlier in the process, and 4) there are likely some efficiency gains from processing a batch of new versions at once. The benefit of a CDC-based approach would be latency (assuming we're updating system.table_statistics very frequently with the derived stats). There are some similarities here to the micro-batch processing vs stream processing split.

using an approach from one of the papers below for maintaining up-to-date stats over streaming data

Do you happen to have a link to this?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 29 at r1 (raw file):

For non-extreme values, we can take a cue from the user in order to determine
which portion of an index to scan: which part of the index are they using for
their queries? If data is indexed by timestamp and they are primarily scanning

[nit] The timestamp situation described here should be covered by the min/max, it doesn't belong in the paragraph about non-extreme values

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

For each deleted or updated value seen by the iterator, we'd need to retrieve the previous value of that key to know which value to store as a negative sample, which (if it's even possible) seems like it would reduce the performance benefit of the time bound iterator.

Good point! This is a similar problem to what we are running into in #69191. The idea that PR takes is to use a separate non-time-bound iterator to look up the previous version for each new that the TBI finds. It does reduce the performance benefit of the time-bound iterator, but in the vast majority of cases, this effect should be marginal. Even with the second iterator, this approach still performs O(num_new_versions) work, as opposed to the O(num_all_versions) work that a full stats computation performs.

A similar idea that might work a bit better was proposed by @cucaroach -- he suggested using CDC to maintain up-to-date stats outside of the database

Aren't these two approaches analogous? A CDC-based approach will require the WITH DIFF option, which means that we'd be paying for the same per-version lookup either way.

With either approach, the challenge seems to be incremental maintenance of stats.

The benefits I see to the periodic TBI scan approach are that 1) we avoid pushing any work on to writers, 2) we can sample earlier in the process (e.g. before hitting the network and even before the prev version lookup), 3) we can throttle earlier in the process, and 4) there are likely some efficiency gains from processing a batch of new versions at once. The benefit of a CDC-based approach would be latency (assuming we're updating system.table_statistics very frequently with the derived stats). There are some similarities here to the micro-batch processing vs stream processing split.

using an approach from one of the papers below for maintaining up-to-date stats over streaming data

Do you happen to have a link to this?

I think there's value to the incremental MVCC-based approach, even if we only use it to update histograms. I could envision something where we still take full stats periodically, and do the incremental thing more frequently in-between. It all depends how efficient the incremental scan is in practice. In any case, what is being proposed in this RFC is more lightweight and we'd likely want to keep it even if we will have the MVCC-based incremental one day.

As for naming, I agree that "incremental" in this RFC can be confusing, maybe we should rename to "index-based histogram updates" or similar.

RaduBerinde

Nice work! It looks like a fairly contained project and it would address a lot of known pain points.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 456 at r1 (raw file):

issues. We don’t want to waste work, however, so if two different nodes
simultaneously request an incremental refresh on the same range of the same
table, we should probably cancel one of the requests.

Increasing the traffic to the jobs table might also be an issue if we use a full-fledged job for this.

rytaft

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @RaduBerinde, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

... It does reduce the performance benefit of the time-bound iterator, but in the vast majority of cases, this effect should be marginal....

Great to know -- sounds like this will be easier than I thought.

Aren't these two approaches analogous?

I'm not too familiar with how CDC works, but based on what you're saying it sounds like they are.

The benefits I see to the periodic TBI scan approach are...

Yea, does sound like this is promising. I do agree with @RaduBerinde, though, that it would be good to start with something very lightweight like we propose in this RFC, but I think we can add this TBI/MVCC and/or CDC approach as the next step. Perhaps "phase 3"...?

we can throttle earlier in the process

Can you clarify this?

Do you happen to have a link to this?

I added comments with links for all the papers. It's certainly not an exhaustive list of papers and there are probably things that we missed, but it's a start. 1, 5, and 7 are probably most relevant to this discussion.

docs/RFCS/20220126_incremental_statistics_collection.md, line 374 at r1 (raw file):

1. Gibbons, Phillip B., Yossi Matias, and Viswanath Poosala. "Fast incremental
   maintenance of approximate histograms." In _VLDB_, vol. 97, pp. 466-475. 1997.

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/papers/Histogram/Gibbons-fast-incr-histogram.pdf

docs/RFCS/20220126_incremental_statistics_collection.md, line 385 at r1 (raw file):

2. Gibbons, Phillip B., and Yossi Matias. "New sampling-based summary statistics
   for improving approximate query answers." In _Proceedings of the 1998 ACM
   SIGMOD international conference on Management of data_, pp. 331-342. 1998.

https://www.researchgate.net/profile/Yossi-Matias/publication/2634326_New_Sampling-Based_Summary_Statistics_for_Improving_Approximate_Query_Answers/links/0fcfd50baafa968854000000/New-Sampling-Based-Summary-Statistics-for-Improving-Approximate-Query-Answers.pdf

docs/RFCS/20220126_incremental_statistics_collection.md, line 395 at r1 (raw file):

3. Cormode, Graham, and Shan Muthukrishnan. "What's hot and what's not: tracking
   most frequent items dynamically." _ACM Transactions on Database Systems
   (TODS)_ 30, no. 1 (2005): 249-278.

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/papers/Histogram/2005-Cormode-Histogram.pdf

docs/RFCS/20220126_incremental_statistics_collection.md, line 404 at r1 (raw file):

4. Aboulnaga, Ashraf, and Surajit Chaudhuri. "Self-tuning histograms: Building
   histograms without looking at data." _ACM SIGMOD Record_ 28, no. 2 (1999):
   181-192.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.4864&rep=rep1&type=pdf

docs/RFCS/20220126_incremental_statistics_collection.md, line 418 at r1 (raw file):

   Muthukrishnan, and Martin J. Strauss. "Fast, small-space algorithms for
   approximate histogram maintenance." In _Proceedings of the thirty-fourth
   annual ACM symposium on Theory of computing_, pp. 389-398. 2002.

http://perso.ens-lyon.fr/pierre.borgnat/MASTER2/gilbert_ggikms_stoc2002.pdf

docs/RFCS/20220126_incremental_statistics_collection.md, line 429 at r1 (raw file):

6. Thaper, Nitin, Sudipto Guha, Piotr Indyk, and Nick Koudas. "Dynamic
   multidimensional histograms." In _Proceedings of the 2002 ACM SIGMOD
   international conference on Management of data_, pp. 428-439. 2002.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.2078&rep=rep1&type=pdf

docs/RFCS/20220126_incremental_statistics_collection.md, line 439 at r1 (raw file):

7. Donjerkovic, Donko, Yannis Ioannidis, and Raghu Ramakrishnan. Dynamic
   histograms: Capturing evolving data sets. University of Wisconsin-Madison
   Department of Computer Sciences, 1999.

https://minds.wisconsin.edu/bitstream/handle/1793/60206/TR1396.pdf?sequence=1

docs/RFCS/20220126_incremental_statistics_collection.md, line 456 at r1 (raw file):

Previously, RaduBerinde wrote…

Increasing the traffic to the jobs table might also be an issue if we use a full-fledged job for this.

Good point. We could consider just running incremental stats as a normal query, although then we wouldn't get the benefit that the jobs table provides of preventing multiple nodes from triggering stats at the same time.

rytaft · 2022-01-31T21:15:57Z

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

As for naming, I agree that "incremental" in this RFC can be confusing, maybe we should rename to "index-based histogram updates" or similar.

One advantage of this proposal is that it also supports accurate distinct count updates when a histogram is available. But agreed that we should change the name ... maybe "index-based partial stats collection"?

cucaroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @RaduBerinde, @rharding6373, @rytaft, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 456 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Good point. We could consider just running incremental stats as a normal query, although then we wouldn't get the benefit that the jobs table provides of preventing multiple nodes from triggering stats at the same time.

Should we have a "detached" mode like restore so the user could do either?

msirek

Great feature idea and writeup!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @RaduBerinde, @rharding6373, @rytaft, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

As for naming, I agree that "incremental" in this RFC can be confusing, maybe we should rename to "index-based histogram updates" or similar.

One advantage of this proposal is that it also supports accurate distinct count updates when a histogram is available. But agreed that we should change the name ... maybe "index-based partial stats collection"?

The benefit of range predicates would be speed since a full scan is not needed, and covering whole contiguous ranges of values. If we support any predicate, even weird stuff like WHERE (a % 2) = 0 it would be good to see how this effects building of buckets and merging of buckets.

docs/RFCS/20220126_incremental_statistics_collection.md, line 228 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

+1 I don't see why we need to require that the users provide specific value(s)

+1 Supporting whole pre-existing bucket ranges only under the covers allows for easier merging using exact bucket replacement.

docs/RFCS/20220126_incremental_statistics_collection.md, line 243 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

That's a good idea. If we go this route we'd probably want to do the splicing when updating the stats cache to avoid redundant work in the optimizer.

This seems like a very straightforward and powerful way to keep stats updated. This approach could basically extend the automatic statistics feature with finer granularity. While scanning an index on column a values, access the spliced histogram for that column and detect when the actual number of rows in each bucket range deviates a certain percentage amount from what's recorded in the bucket. Then schedule incremental stats collection on the exact range of values for that bucket (or recalculate the bucket during the scan). Do this for all buckets covered by the scan. At the end, write out a new partial histogram for column a to system.table_statistics. When the stats cache is repopulated for this column, look at all histograms for this column since we've held onto the old ones, and for each bucket keep only the one with the most recent timestamp. Eventually, if none of the buckets in the original histogram are being used, that row in system.table_statistics can be deleted. Or, if only 20% of the buckets in that original row are used, schedule incremental stats collection on those buckets so that all of the buckets can now be covered by fresh stats. Maybe there would be issues with skew if some buckets start getting more rows than others and we no longer maintain the equi-depth property. Some other technique might be needed to rebalance rows between buckets. This might not be as lightweight as scanning only the max and min portions of an index, but maybe the scan could be done less frequently, or random smaller ranges of rows could be scanned.

mgartner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @RaduBerinde, @rharding6373, @rytaft, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

The benefit of range predicates would be speed since a full scan is not needed, and covering whole contiguous ranges of values. If we support any predicate, even weird stuff like WHERE (a % 2) = 0 it would be good to see how this effects building of buckets and merging of buckets.

I had a thought similar to @michae2 as well, but agree with @msirek that we'd have to be careful to only allow predicates that result in an efficient index scan. And we'd have to determine what the spans to scan actually are. Obvioulsy the optimizer can do this, but CREATE STATISTICS statements circumvent the optimizer entirely, so I figured it'd be easier to restrict the syntax to expressions that will be easy to turn into index spans. But implementation details aside, the WHERE syntax feels more natural and would be preferrable.

I'll throw some other syntax ideas into the pot:

-- Phase 1 syntax for scanning index extremes.
CREATE PARTIAL STATISTICS my_stat ON a FROM t AT EXTREMES

-- Or, get rid of "AT EXTREMES?
CREATE PARTIAL STATISTICS my_stat ON a FROM t

-- Or, get rid of "PARTIAL" in favor of "AT EXTREMES".
CREATE STATISTICS my_stat ON a FROM t AT EXTREMES

docs/RFCS/20220126_incremental_statistics_collection.md, line 125 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

A potential problem with constraining more than the first column is that then we won't be able to cleanly update the histogram for the first column (at least not until we support multi-column histograms).

We might want to consider making a special exception for hash and partitioned indexes where we allow constraining the second column but not the first.

There's also the issue that in a multi-column index, new maximums/minimums are only guaranteed to be on the extreme ends of the index if they are the first column in the index. For example, an INDEX (a, b) won't necessarily have maximum b values at the high end of the index. To find new extreme values of b, you'd have to scan every value of a. For a hash-sharded and partitioned indexes, we can scan all possible shard/partition columns since there is a finite number. But we can't do the same for other multi-column indexes.

docs/RFCS/20220126_incremental_statistics_collection.md, line 228 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think that would be desirable if there are a lot of rows added in a given range. We'd need to have some limit on the total number of buckets, though.

I think we could follow a similar strategy as mentioned in the next paragraph: Split into new buckets with depths as close as possible to the depths of the original histogram's buckets, but at some point stop splitting if the number of histogram buckets reaches a limit.

docs/RFCS/20220126_incremental_statistics_collection.md, line 243 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

This seems like a very straightforward and powerful way to keep stats updated. This approach could basically extend the automatic statistics feature with finer granularity. While scanning an index on column a values, access the spliced histogram for that column and detect when the actual number of rows in each bucket range deviates a certain percentage amount from what's recorded in the bucket. Then schedule incremental stats collection on the exact range of values for that bucket (or recalculate the bucket during the scan). Do this for all buckets covered by the scan. At the end, write out a new partial histogram for column a to system.table_statistics. When the stats cache is repopulated for this column, look at all histograms for this column since we've held onto the old ones, and for each bucket keep only the one with the most recent timestamp. Eventually, if none of the buckets in the original histogram are being used, that row in system.table_statistics can be deleted. Or, if only 20% of the buckets in that original row are used, schedule incremental stats collection on those buckets so that all of the buckets can now be covered by fresh stats. Maybe there would be issues with skew if some buckets start getting more rows than others and we no longer maintain the equi-depth property. Some other technique might be needed to rebalance rows between buckets. This might not be as lightweight as scanning only the max and min portions of an index, but maybe the scan could be done less frequently, or random smaller ranges of rows could be scanned.

I like the idea of moving the splicing/merging of full and partial statistics to a pre-process step when the stats cache is populated. I'm already planning on adding some new infrastructure to give stats pre-process a home. We already do some preprocessing that is similar: we create a NULL histogram bucket in-memory based on the null_count of a column. And I'll soon be adding in-memory histogram buckets when we can extrapolate new buckets for ascending keys.

docs/RFCS/20220126_incremental_statistics_collection.md, line 247 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

No, it's still true. Think about it this way:

Suppose that when the original stats were collected, values ranged from 0 to 100, and there were a total of 1000 rows.

When the incremental stats are collected, suppose values now range from 50 to 150 (in your scenario where the customer is deleting rows on one end of the range). For the incremental collection, we're only scanning from 100 to 150, but suppose there are a total of 2000 rows in this range. We know that there are at least 2000 rows in the whole range from 50 to 150, even though we only scanned the upper half. Therefore it's safe to update the row count from the old value of 1000 to the new value of 2000.

I think the key part is that we can use the new row_count and distinct_count, but we can't add the new counts with the old counts. In your example, imagine that 900 rows have a value of 0, and the rest are evenly distributed from 1 to 100, for a total of 1000 rows. Then all rows with values 0 to 50 are deleted, and 1000 rows are added in the range 101 to 150. Adding the old and new row counts would result in 2000 rows, when in reality there are 1050 rows.

So I think the next paragraph is not necessarily correct, unless we're ok with the misestimations that this can cause.

docs/RFCS/20220126_incremental_statistics_collection.md, line 385 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

https://www.researchgate.net/profile/Yossi-Matias/publication/2634326_New_Sampling-Based_Summary_Statistics_for_Improving_Approximate_Query_Answers/links/0fcfd50baafa968854000000/New-Sampling-Based-Summary-Statistics-for-Improving-Approximate-Query-Answers.pdf

Do you want me to add these links to the RFC?

rytaft

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @RaduBerinde, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 247 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I think the key part is that we can use the new row_count and distinct_count, but we can't add the new counts with the old counts. In your example, imagine that 900 rows have a value of 0, and the rest are evenly distributed from 1 to 100, for a total of 1000 rows. Then all rows with values 0 to 50 are deleted, and 1000 rows are added in the range 101 to 150. Adding the old and new row counts would result in 2000 rows, when in reality there are 1050 rows.

So I think the next paragraph is not necessarily correct, unless we're ok with the misestimations that this can cause.

I wasn't suggesting adding the row counts here, though, since I thought @rharding6373's comment was referring to the case of a general range scan when no histogram is available and adding the counts is not possible. For this case, in your example, the row_count wouldn't change, since the number of rows in the incremental collection is the same as the original row count, so we'd keep it the same.

But we do add rows when there is a histogram available or when we know we're scanning data that's never been scanned before, and you are right that we could see some inaccuracies if there were significant changes to portions of the table that were not scanned by incremental stats.

I think to handle this case we might want to go a step further than just starting the scan from the last minimum/maximum seen: we should also find the current minimum/maximum, and if it's greater/less than the previous, we should delete the old buckets accordingly.

docs/RFCS/20220126_incremental_statistics_collection.md, line 385 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Do you want me to add these links to the RFC?

Yes please!

docs/RFCS/20220126_incremental_statistics_collection.md, line 456 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Should we have a "detached" mode like restore so the user could do either?

I think RESTORE still runs as a job whether or not DETACHED is used, so it wouldn't be quite the same. But we could consider something similar.

msirek

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @mgartner, @michae2, @RaduBerinde, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md, line 86 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I had a thought similar to @michae2 as well, but agree with @msirek that we'd have to be careful to only allow predicates that result in an efficient index scan. And we'd have to determine what the spans to scan actually are. Obvioulsy the optimizer can do this, but CREATE STATISTICS statements circumvent the optimizer entirely, so I figured it'd be easier to restrict the syntax to expressions that will be easy to turn into index spans. But implementation details aside, the WHERE syntax feels more natural and would be preferrable.

I'll throw some other syntax ideas into the pot:
-- Phase 1 syntax for scanning index extremes.
CREATE PARTIAL STATISTICS my_stat ON a FROM t AT EXTREMES

-- Or, get rid of "AT EXTREMES?
CREATE PARTIAL STATISTICS my_stat ON a FROM t

-- Or, get rid of "PARTIAL" in favor of "AT EXTREMES".
CREATE STATISTICS my_stat ON a FROM t AT EXTREMES

Do we have a common theme or guidance for new features we create towards flexibility vs. ease of use? For example, some companies may choose to make their SQL extensions as flexible as possible and not limit what the user can do, but then write guides on how to best utilize the feature, so it performs well. This path this would give the most options but require more customer education. Does anyone know if CRL has such a theme or guidance for new features?

rytaft

Thanks for all the great reviews and discussion, and sorry it took us so long to update the doc. I think I've addressed all the comments.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @michae2, @rharding6373, and @yuzefovich)

docs/RFCS/20220126_incremental_statistics_collection.md line 29 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] The timestamp situation described here should be covered by the min/max, it doesn't belong in the paragraph about non-extreme values

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 86 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

Do we have a common theme or guidance for new features we create towards flexibility vs. ease of use? For example, some companies may choose to make their SQL extensions as flexible as possible and not limit what the user can do, but then write guides on how to best utilize the feature, so it performs well. This path this would give the most options but require more customer education. Does anyone know if CRL has such a theme or guidance for new features?

Looks like your comment got cut off, @msirek. But if I understand your question, we don't have any hard and fast rule about flexibility v. ease of use, as far as I know. In general I think our preference is to err on the side of ease of use, but provide some options for power users to have flexibility as well.

For this feature, I think the WHERE syntax (for Phase 2) would give us the most flexibility and would be natural for users. To ensure that only predicates that fully constrain the index are used, we can return an error in all other cases, similar to #80499.

For Phase 1, I don't feel strongly about whether we use PARTIAL or AT EXTREMES or some combination. I changed the examples to use both, but I don't think we need to decide on the final syntax here.

I think I've now made changes throughout the doc that cover most of the discussion in this thread.

docs/RFCS/20220126_incremental_statistics_collection.md line 88 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: highlight column "a" as a throughout this paragraph.

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 125 at r1 (raw file):
I added a clarification:

To start, we will only allow predicates that constrain either the first index
column or the first column after any hash or partition columns. This restriction
is needed to ensure that we can accurately update the histogram for the constrained
column.

docs/RFCS/20220126_incremental_statistics_collection.md line 145 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: highlight the table name.

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 228 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

+1 Supporting whole pre-existing bucket ranges only under the covers allows for easier merging using exact bucket replacement.

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 228 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I think we could follow a similar strategy as mentioned in the next paragraph: Split into new buckets with depths as close as possible to the depths of the original histogram's buckets, but at some point stop splitting if the number of histogram buckets reaches a limit.

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 243 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I like the idea of moving the splicing/merging of full and partial statistics to a pre-process step when the stats cache is populated. I'm already planning on adding some new infrastructure to give stats pre-process a home. We already do some preprocessing that is similar: we create a NULL histogram bucket in-memory based on the null_count of a column. And I'll soon be adding in-memory histogram buckets when we can extrapolate new buckets for ascending keys.

I changed the text to mention that the splicing would happen when populating the stats cache.

docs/RFCS/20220126_incremental_statistics_collection.md line 247 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I wasn't suggesting adding the row counts here, though, since I thought @rharding6373's comment was referring to the case of a general range scan when no histogram is available and adding the counts is not possible. For this case, in your example, the row_count wouldn't change, since the number of rows in the incremental collection is the same as the original row count, so we'd keep it the same.

But we do add rows when there is a histogram available or when we know we're scanning data that's never been scanned before, and you are right that we could see some inaccuracies if there were significant changes to portions of the table that were not scanned by incremental stats.

I think to handle this case we might want to go a step further than just starting the scan from the last minimum/maximum seen: we should also find the current minimum/maximum, and if it's greater/less than the previous, we should delete the old buckets accordingly.

Added some text to this effect.

docs/RFCS/20220126_incremental_statistics_collection.md line 374 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/papers/Histogram/Gibbons-fast-incr-histogram.pdf

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 385 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Yes please!

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 395 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/papers/Histogram/2005-Cormode-Histogram.pdf

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 404 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.4864&rep=rep1&type=pdf

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 418 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

http://perso.ens-lyon.fr/pierre.borgnat/MASTER2/gilbert_ggikms_stoc2002.pdf

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 429 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.2078&rep=rep1&type=pdf

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 439 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

https://minds.wisconsin.edu/bitstream/handle/1793/60206/TR1396.pdf?sequence=1

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 446 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Yep, we should use the same approach that we use with full stats collections, where we delay the start of collection by the amount of time in the AOST clause

Done.

docs/RFCS/20220126_incremental_statistics_collection.md line 456 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think RESTORE still runs as a job whether or not DETACHED is used, so it wouldn't be quite the same. But we could consider something similar.

Done.

rytaft

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @michae2, @rharding6373, and @yuzefovich)

rytaft

I plan to merge this at the end of this coming week. Please provide any additional comments before then.

@msirek since you'll be working on this (at least Phase 1), please make sure you agree with the plan or suggest improvements.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @michae2, @rharding6373, and @yuzefovich)

Release note: None Co-authored-by: Rebecca Taft <[email protected]>

rytaft

Seeing no objections I'm going to go ahead and merge this.

bors r+

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @cucaroach, @michae2, @rharding6373, and @yuzefovich)

craig · 2022-06-01T19:18:58Z

Build succeeded:

GitHub CI (Cockroach)

mgartner requested a review from a team January 27, 2022 18:44

mgartner requested a review from a team as a code owner January 27, 2022 18:44

yuzefovich reviewed Jan 28, 2022

View reviewed changes

michae2 requested changes Jan 28, 2022

View reviewed changes

rharding6373 reviewed Jan 28, 2022

View reviewed changes

rytaft reviewed Jan 29, 2022

View reviewed changes

nvanbenschoten reviewed Jan 29, 2022

View reviewed changes

nvanbenschoten reviewed Jan 31, 2022

View reviewed changes

RaduBerinde reviewed Jan 31, 2022

View reviewed changes

rytaft reviewed Jan 31, 2022

View reviewed changes

cucaroach reviewed Jan 31, 2022

View reviewed changes

msirek reviewed Feb 1, 2022

View reviewed changes

mgartner commented Feb 1, 2022

View reviewed changes

rytaft reviewed Feb 2, 2022

View reviewed changes

msirek reviewed Feb 2, 2022

View reviewed changes

rytaft mentioned this pull request Apr 13, 2022

opt: predict future statistics based on historical stats #79872

Closed

rytaft force-pushed the rfc-incremental-statistics branch from 2151eeb to b9ccc6e Compare May 8, 2022 01:16

rytaft approved these changes May 8, 2022

View reviewed changes

rytaft changed the title ~~rfcs: incremental statistics collection~~ rfcs: partial statistics collection May 8, 2022

rytaft force-pushed the rfc-incremental-statistics branch from b9ccc6e to 723c086 Compare May 8, 2022 01:25

rytaft approved these changes May 8, 2022

View reviewed changes

rytaft approved these changes May 22, 2022

View reviewed changes

rfcs: partial statistics collection

c07cfd3

Release note: None Co-authored-by: Rebecca Taft <[email protected]>

rytaft force-pushed the rfc-incremental-statistics branch from 723c086 to c07cfd3 Compare June 1, 2022 17:45

rytaft approved these changes Jun 1, 2022

View reviewed changes

craig bot merged commit f026bc1 into cockroachdb:master Jun 1, 2022

mgartner deleted the rfc-incremental-statistics branch August 8, 2022 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfcs: partial statistics collection #75625

rfcs: partial statistics collection #75625

mgartner commented Jan 27, 2022

cockroach-teamcity commented Jan 27, 2022

yuzefovich left a comment

michae2 left a comment

rharding6373 left a comment

rharding6373 left a comment

rytaft left a comment

nvanbenschoten left a comment

rytaft commented Jan 31, 2022

nvanbenschoten left a comment

RaduBerinde left a comment

RaduBerinde left a comment

rytaft left a comment

rytaft commented Jan 31, 2022

cucaroach left a comment

msirek left a comment

mgartner left a comment

rytaft left a comment

msirek left a comment

rytaft left a comment

rytaft left a comment

rytaft left a comment

rytaft left a comment

craig bot commented Jun 1, 2022

rfcs: partial statistics collection #75625

rfcs: partial statistics collection #75625

Conversation

mgartner commented Jan 27, 2022

cockroach-teamcity commented Jan 27, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

rytaft commented Jan 31, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rytaft commented Jan 31, 2022

cucaroach left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

craig bot commented Jun 1, 2022