diff --git a/docs/RFCS/20220120_row_level_ttl.md b/docs/RFCS/20220120_row_level_ttl.md new file mode 100644 index 000000000000..001e49ee8b9b --- /dev/null +++ b/docs/RFCS/20220120_row_level_ttl.md @@ -0,0 +1,335 @@ +# Row Level TTL +* Feature Name: Row-level TTL +* Status: in-progress +* Start Date: 2021-12-14 +* Authors: Oliver Tan +* RFC PR: [#75189](75189) +* Cockroach Issue: [#20239](20239) + +# Summary +Row-level "time to live" (TTL) is a mechanism in which rows from a table +automatically get deleted once the row surpasses an expiration time (the "TTL"). +This has been a [feature commonly asked for](20239). + +This RFC proposes a CockroachDB level mechanism to support row-level TTL, where +rows will be deleted after a certain period of time. As a further extension in a +later release, rows will automatically be hidden after expiring a TTL. + +The following declarative syntaxes will initialize a table with row-level TTL: +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) TTL '5 minutes' +``` +By implementing row-level TTL, we are saving developers from writing a complex +scheduled job and additional application logic to handle this functionality. It +also brings us to parity with other database systems which support this feature. + +**Note**: this does NOT cover partition level deletes, where deletes can be +*"accelerated" by deleting entire "partitions" at a time. + +# Motivation +Today, developers who need row-level TTL need to roll out their own mechanism +for deleting rows as well as adding application logic to filter out the expired +rows. + +The job itself can get complex to write. We have a [guide](cockroach TTL advice) +which was difficult to perfect - when we implemented it ourselves, we found +there were multiple issues related to performance. Developers have to implement +and manage several knobs to balance deletion time and performance on foreground +traffic (traffic from the application): +* how to shard out the delete & run the deletes in parallel +* how many rows to SELECT and DELETE at once +* a rate limit to control the rate of SELECTs and DELETEs + +Furthermore, developers require productionizing the jobs themselves, adding more +moving pieces in their production environment. + +Having a row-level TTL is sought often enough that warrants being a feature +supported by CockroachDB natively, avoiding the complexity of another moving +part and complex logic from developers and making data easy. + +# Technical design + +## Syntax and Table Metadata + +### Table Creation +TTLs are defined on a per-table basis. Developers can create a table with a TTL +using the following syntax: + +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) TTL +``` + +Users can add additional options to control their TTL (see full list of options +below): + +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) TTL (expiration_time = '5 minutes', delete_batch_size = 1, job_cron = '* * 1 * *') +``` + +This automatically creates a repeating scheduled job for the given table, as +well as adding the HIDDEN column `crdb_internal_expiration` to symbolize the +TTL: + +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT, + crdb_internal_expiration TIMESTAMPTZ NOT NULL DEFAULT current_timestamp() + '5 minutes' ON UPDATE current_timestamp() + '5 minutes' +) TTL (expiration_time = '5 minutes') +``` + +A user can override the crdb_internal_expiration expression by setting a custom +option for the column to represent as a TTL: + +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT, + other_ts_field TIMESTAMPTZ +) TTL (expiration_column = 'other_ts_field') +``` + +TTL metadata is stored on the TableDescriptor: +```protobuf +message TableDescriptor { + // … + message TTL { + option (gogoproto.equal) = true; + + message RowLevelTTL { + // DeleteBatchSize is the number of rows to delete in each batch. + optional int64 delete_batch_size = 1; // defaults to 100 + // SelectBatchSize is the number of rows to select at a time. + optional int64 select_batch_size = 2; // defaults to 500 + // MaximumRowsDeletedPerSecond controls the amount of rows to delete per second. + optional int64 maximum_rows_deleted_per_second = 3; // defaults to 0 (no limit) + // RangeConcurrency controls the amount of ranges to delete at a time. Defaults to 8. + optional int64 range_concurrency = 4; + } + + // DurationExpr is the automatically assigned interval for when the TTL should apply to a row. + optional string duration_expr = 1 [(gogoproto.nullable)=false]; + optional string deletion_cron = 2 [(gogoproto.nullable)=false]; + oneof details { + RowLevelTTL row_level_ttl = 3 [(gogoproto.customname="RowLevelTTL")]; + // reserved for partitioning + } + } + optional TTL ttl = 47 [(gogoproto.customname)="TTL"]; +} +``` + +#### User override for the `crdb_internal_expiration` column +In an ideal world, the user can simply update the `crdb_internal_expiration` +value to "override" the TTL. However, with our current defaults, this value +would get overridden if the row is updated due to the `ON UPDATE` clause. The +`ON UPDATE` should ideally be: + +```sql +CASE WHEN crdb_internal_expiration IS NULL +-- NULL means the row should never be TTL'd +THEN NULL +-- otherwise, take whatever's larger: the user set crdb_internal_expiration +-- value or an update TTL based on the current_timestamp. +ELSE max(crdb_internal_expiration, current_timestamp() + 'table ttl value') +``` + +However, due to the limitation that `ON UPDATE` cannot reference a table +expression, this cannot yet be implemented. This can change when we implemented +[triggers](https://github.com/cockroachdb/cockroach/issues/28296). + +A workaround may be planned after the first iteration, but it is not deemed +a blocking feature for the first iteration of TTL. + +### Applying or Altering TTL for a table +TTL can be configured using `ALTER TABLE`: + +```sql +ALTER TABLE tbl SET TTL; -- introduce a TTL. no-op if there is already a TTL +ALTER TABLE tbl SET TTL (expiration_column = 'other_ts_field'); +-- introduce a TTL with the given option OR apply the given option to the TTL +ALTER TABLE tbl SET TTL (expiration_column = NULL) -- drop an option from the TTL +``` + +Note initially any options applied to the TTL in regards to the job will not +apply whilst the job is running; the user must restart the job for the settings +to take effect. A HINT will be displayed to the user if this is required. + +### Dropping TTL for a table +TTL can be dropped using `ALTER TABLE`: + +```sql +ALTER TABLE tbl DROP TTL +``` + +The DROP will not drop any TTL columns. However, recreating the TTL column will +not succeed as it may risk expiring all the rows immediately. Instead, the user +must either DROP the existing TTL columns or specify it as the `ttl_column` +option. A hint will be provided detailing this. Deletion Process When a TTL +table is created, we create a SCHEDULED JOB which defaults to running daily. + +## The Deletion Job + +Row deletion is handled by a scheduled job in the jobs framework. The scheduled +job gets created (or modified) by the syntax detailed in the previous section. +The deletes are a SQL-based delete, meaning they will show up on changefeeds +as a regular delete by the root user. + +### Job Algorithm + +On each job invocation, the deletion job performs the following: +* Establish a "start time". +* Paginate over all ranges for the table. Allocate these to a "range worker pool". +* Then, for each range worker performs the following: + * Traverse the range in order of PRIMARY KEY: + * Issue a SQL SELECT AS OF SYSTEM TIME '-30s' query on the PRIMARY KEYs up to + select_batch_size entries in which the TTL column is less than the established + start time (note we need to establish the start time before we run or the job + could potentially never finish). + * Issue a SQL DELETE up to delete_batch_size entries from the table using the + PRIMARY KEYs selected above. + +### Admission Control +To ensure the deletion job does not affect foreground traffic, we plan on using +[admission control](admission control) on a SQL transaction at the lowest +possible value (-256). + +From previous experimentation, this largely regulates the amount of knob +controls users may need to configure, making this a "configure once, walk away" +process which does not need heavy monitoring. + +### Rate Limiting +In an ideal world, admission control could reject traffic and force a backoff +and retry if it detects heavy load and rate limiting is not needed. However, +we will introduce a rate limiter option just in case admission control is not +adequate, especially in its infancy of being exposed. We can revisit this at a +later point. + +### Controlling Delete Rate +Whilst our aim is for foreground traffic to be unaffected whilst deletion rate +manages to clean up "enough", we expect that there is still tinkering developers +must make to ensure a rate for their workload. + +As such, users have knobs they can control to control the delete rate, +including: how often the deletion job runs (controls amount of "junk" data left) +GC time (when tombstones are removed and space is therefore reclaimed) the size +of the ranges, which has knock on effects for `range_concurrency`. + +As part of the `TTL (option = value, …)` syntax, users can control the following +options: + +Option | Description +--- | --- +select_batch_size | How many rows to fetch from the range that have expired at a given time. Defaults to 500. +delete_batch_size | How many rows to delete at a time. Defaults to 100. +range_concurrency | How many concurrent ranges are being worked on at a time. Defaults to 8. +admission_control_priority | Priority of the admission control to use when deleting. Set to a high amount if deletion traffic is the priority. +select_as_of_system_time | AS OF SYSTEM TIME to use for the SELECT clause. Defaults to `-30s`. Proposed as a hidden option, as I'm not sure anyone will actually need it. +maximum_rows_deleted_per_second | Maximum number of rows to be deleted per second (acts as the rate limit). Defaults to 0 (signifying none) +pause | Pauses the TTL job from executing. Existing jobs will need to be paused using CANCEL JOB. +job_cron | Frequency the job runs. + +#### A note on delete_batch_size +Deleting rows on tables with secondary indexes on a table lays more intents, +hence causing higher contention. In this case, there is an argument to delete +less rows at a time (or indeed, just one row at a time!) to avoid this. However, +testing showed deletes could never catch up in this case and realistically a +very small number of rows can be deleted in this way. "100" was set as the +balance for now. + +## Filtering out Expired Rows +Rows that have expired their TTL can be optionally removed from all SQL query +results. However, this is not planned to be implemented for the first release of +TTL. + +## Foreign Keys +To avoid additional complexity in the initial implementation, foreign keys to or +from TTL tables will not be permitted. More thought has to be put on ON +DELETE/ON UPDATE CASCADEs before we can look at allowing this functionality. + +## Observability +As mentioned above, we expect users will need to tinker with the deletion job to +ensure foreground traffic is not affected. + +To monitor "lag rate", we will expose a metric similar to `estimated_row_count` +called `estimated_active_row_count`, which counts the number of rows which are +estimated to not have expired TTL. Users will be able to poll this to determine +the effectiveness of the TTL deletion job. + +We will also expose a number of metrics on the job to help gauge deletion +progress: +* selection and deletion rate +* selection and deletion latency + +## Alternatives + +### KV Approach to Row Expiration +There was a major competitor to the "SQL-based deletion approach" - having the +KV be the layer responsible for TTL. The existing GC job would be responsible +for automatically clearing these rows. The KV layer would not return any row +which has expired from the TTL, in theory also removing contention from intent +resolution thus reducing foreground performance issues. + +The main downside of this approach is that is it not "SQL aware", resulting in +a few problems: +* Secondary index entries would need some reconciliation to be garbage + collected. As KV is not aware which rows in the secondary index need to be + deleted from the primary index, this would represent blocking TTL tables + allowing row-level TTL. +* If we wanted CDC to work, tombstones would need to be written by the KV GC + process. This adds further complexity to CDC. + +As row-level TTL is a "SQL level" feature, it makes sense that something in the +SQL layer would be most appropriate to handle it. See [comparison +doc](comparison doc) for other observations. + + +### Alternative TTL columns +Another proposal for TTL columns was to have two columns: +* a `last_updated` column which is the last update timestamp of the column. +* a `ttl` column which is the interval since last_updated before the row + gets deleted. + +An advantage here is that it is easy to specify an "override" - bump the +`ttl` interval value to be some high number, and `last_updated` still gets +updated. + +We opted against this as we felt it was less obvious than the absolute +timestamp long term vision. + +### Alternative TTL syntax +An alternative SQL syntax for TTL would be to use the existing +`WITH (storage_parameter = value)` syntax PostgreSQL provides. However, +the experience of this is perceived as a little clunky in terms of enabling +or disabling TTL compared to the currently proposed syntax as it is unclear +that the table columns themselves may be modified. + +## Future Improvements +* Special annotation on TTL deletes for CDC: we may determine in future that + DELETEs issued by a TTL job are a "special type" of delete that users can + process. +* Secondary index deletion improvements: in future, we can remove the + performance hit when deleting secondary indexes by doing a delete on the + secondary index entry before the delete on the primary index entry. This is + predicated on filtering out expired rows working. +* Admission control improvements: In future, we can explore applying + backpressure from admission control into the job to automatically tune + deletion rate to least affect foreground traffic. + +## Open Questions +N/A + +[#20239]: https://github.com/cockroachdb/cockroach/issues/20239 +[#75189]: https://github.com/cockroachdb/cockroach/pull/75189 +[cockroach TTL advice]: https://www.cockroachlabs.com/docs/stable/bulk-delete-data.html +[admission control]: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/admission_control.md +[comparison doc]: https://docs.google.com/document/d/1HkFg3S-k3s2PahPRQhTgUkCR4WIAtjkSNVylarMC-gY/edit#heading=h.o6cn5faoiokv