-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: implement sql stats compaction for clean up job #69302
Comments
Hi @Azhng, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Adding some additional context for this issue and as an follow up to the original RFC. As of #68401, we start to enforce the limit on maximum number of rows in
Note: without bumping the storage cap, this number decreases as the number of nodes in the cluster increase. This is because the traffic is routed to the cluster through a load balancer. This means that almost all of the 1500 unique fingerprints will be present in the in-memory store for each node. Each node flushes the stats to disk periodically. The primary key in the table includes the It is clear that the number of nodes in the cluster becomes a amplification factor and can limit the useful of the new persisted sql stats system. Compaction across node boundaryA very natural first step here would be compacting statistics across different nodes. To reuse the example previously, this means that with same 1 GB storage cap, we can store up to 27 days (3.8 weeks) worth of historical data regardless of the cluster size! ImplementationThe implementation of this is rather simple:
Since we already have a schedule job set up for this, we can hook into that job directly. Self ThrottlingAlthough the new implementation outlined above can avoid contending with the foreground traffic by:
we will be consuming a lot more compute and IO resources to perform this compaction. In addition, we would be issuing transactions where a lot of keys will be deleted. This would have negative performance implications. Careful steps need to be taken to avoid excesses CPU/IO consumption during the compaction which can cause latency spikes in the foreground traffic. Few things we should include in the implementation above to ensure that our background job is non disruptive:
DownsamplingThis section discusses a potential implementation for how we can implement downsampling. This builds on top of the compaction technique described in the previous section and provide more details for the original RFC. The goal for downsampling is for us to become even more efficient on using the storage space to store more statistics. With the technique described in the previous sections, we will be able to store up to more than 3.8 week worth of statistics with only 1 million rows per table cap (estimated to be 1 GB * replication factor). Definitions:
Currently, to compute maximum amount of time where we can store the statistics for, we compute it as: Now lets defined additional variables:
IdeaCurrently, both This means we would have different segments of data storing at different data resolution, and the fresher the data, the more fine-grain the data resolution. To use our previous example, assuming that a workload generates 1500 unique fingerprints per hour. The total amount of time we can store statistics for becomes: To give an concrete example, our initial data resolution is In this scheme, we would be storing the initial week (150 hours) of data at 1 hour interval, once the data is older than 1 week, it would be stored at 2 hour interval, and we would be doubling the the coarseness of the data resolution as the statistics becomes stale. ImplementationImplementing this downsampling scheme is straight forward. We can leverage the same infrastructure as the ones described in the previous section for job scheduling and self throttling. Then we would be able to compute a list of time intervals using |
We have marked this issue as stale because it has been inactive for |
Currently, SQL Stats clean up job implements a very simple heuristics that deletes the rows with oldest
aggregated_ts
.We would want to eventually implement compaction of the older rows by merging multiple rows into a single row. This means we can store more data for longer duration.
Jira issue: CRDB-9536
The text was updated successfully, but these errors were encountered: