From d815af6e0ca18b34c8422816dc19cb521cfd28e5 Mon Sep 17 00:00:00 2001 From: Oliver Tan Date: Mon, 31 Jan 2022 00:22:46 +1100 Subject: [PATCH] rfc: row level TTL Release note: None --- docs/RFCS/20220120_row_level_ttl.md | 411 ++++++++++++++++++++++++++++ 1 file changed, 411 insertions(+) create mode 100644 docs/RFCS/20220120_row_level_ttl.md diff --git a/docs/RFCS/20220120_row_level_ttl.md b/docs/RFCS/20220120_row_level_ttl.md new file mode 100644 index 000000000000..5f6cd48d2001 --- /dev/null +++ b/docs/RFCS/20220120_row_level_ttl.md @@ -0,0 +1,411 @@ +# Row Level TTL +* Feature Name: Row-level TTL +* Status: in-progress +* Start Date: 2021-12-14 +* Authors: Oliver Tan +* RFC PR: [#75189](#75189) +* Cockroach Issue: [#20239](#20239) + +# Summary +Row-level "time to live" (TTL) is a mechanism in which rows from a table +automatically get deleted once the row surpasses an expiration time (the "TTL"). +This has been a [feature commonly asked for](#20239). + +This RFC proposes a CockroachDB level mechanism to support row-level TTL, where +rows will be deleted after a certain period of time. As a further extension in a +later release, rows rows will be automatically hidden after they've expired +their TTL and before they've been physically deleted. + +The following declarative syntaxes will initialize a table with row-level TTL: +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) WITH (ttl_expire_after = '5 minutes') +``` +By implementing row-level TTL, we are saving developers from writing a complex +scheduled job and additional application logic to handle this functionality. It +also brings us to parity with other database systems which support this feature. + +**Note**: this does NOT cover partition level deletes, where deletes can be +"accelerated" by deleting entire "partitions" at a time. + +# Motivation +Today, developers who need row-level TTL need to roll out their own mechanism +for deleting rows as well as adding application logic to filter out the expired +rows. + +The deletion job itself can get complex to write. We have a [guide](cockroach TTL advice) +which was difficult to perfect - when we implemented it ourselves, +we found there were multiple issues related to performance. Developers have to +implement and manage several knobs to balance deletion time and performance on +foreground traffic (traffic from the application): +* how to shard out the delete & run the deletes in parallel +* how many rows to SELECT and DELETE at once +* a rate limit to control the rate of SELECTs and DELETEs + +Furthermore, developers require productionizing the jobs themselves, adding more +moving pieces in their production environment. + +Having a row-level TTL is sought often enough that warrants being a feature +supported by CockroachDB natively, avoiding the complexity of another moving +part and complex logic from developers and making data easy. + +# Technical design + +At a high level, a user specifies a TTL on the table level syntax. A background +deletion job is scheduled to delete any rows that have expired the TTL using the +SQL delete clause. + +## Syntax and Table Metadata + +### Table Creation +TTLs are defined on a per-table basis. Developers can create a table with a TTL +using the following syntax, extending the `storage_parameter` syntax in +PostgreSQL: + +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) WITH (ttl_expire_after = '5 minutes') +``` + +This automatically creates a repeating scheduled job for the given table, as +well as adding the HIDDEN column `crdb_internal_expiration` to symbolize the +TTL: + +```sql +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT, + crdb_internal_expiration TIMESTAMPTZ + NOT VISIBLE + NOT NULL + DEFAULT current_timestamp() + '5 minutes' + ON UPDATE current_timestamp() + '5 minutes' +) WITH (ttl_expire_after = '5 minutes') +``` + +TTL metadata is stored on the TableDescriptor: +```protobuf +message TableDescriptor { + message RowLevelTTL { + // DurationExpr is the automatically assigned interval for when the TTL should apply to a row. + optional string duration_expr = 1 [(gogoproto.nullable)=false]; + // DeletionCron is the cron-syntax scheduling of the deletion job. + optional string deletion_cron = 2 [(gogoproto.nullable)=false]; + // DeletionPause is true if the TTL job should not run. + // Intended to be a temporary pause. + optional bool deletion_pause = 3 [(gogoproto.nullable)=false]; + // DeleteBatchSize is the number of rows to delete in each batch. + optional int64 delete_batch_size = 4; + // SelectBatchSize is the number of rows to select at a time. + optional int64 select_batch_size = 5; + // MaximumRowsDeletedPerSecond controls the amount of rows to delete per second. + // At zero, it will not impose any limit. + optional int64 max_rows_deleted_per_second = 6; + // RangeConcurrency controls the amount of ranges to delete at a time. + // Defaults to 0 (number of CPU cores). + optional int64 range_concurrency = 7; + } + + // ... + optional RowLevelTTL row_level_ttl = 47 [(gogoproto.customname)="RowLevelTTL"]; + // ... +} +``` + +### Applying or Altering TTL for a table +TTL can be configured using `ALTER TABLE`: + +```sql +-- adding or changing TTL on a table +ALTER TABLE tbl SET (ttl_expire_after = '5 minutes'); +-- adding TTL with other options +ALTER TABLE tbl SET (ttl_expire_after = '5 minutes', ttl_select_batch_size=200); +-- restore default option for the TTL, reusing the PostgreSQL RESET syntax. +ALTER TABLE tbl RESET (ttl_select_batch_size); +``` + +Note initially any options applied to the TTL in regards to the deletion job +will not apply whilst the deletion job is running; the user must +restart the job for the settings to take effect. A HINT will be displayed to the +user if this is required. + +### Dropping TTL for a table +TTL can be dropped using `ALTER TABLE`: + +```sql +ALTER TABLE tbl RESET (ttl_expire_after) +``` + +The RESET will also drop the TTL column. + +### Limitations on the `crdb_internal_expiration` column +* The column can be used in indexes, constraints and primary keys. +* The ON UPDATE and ON DELETE may be overriden, but is rewritten if + `ttl_expire_after` is set. +* The column cannot be re-named. + +## The Deletion Job +Row deletion is handled by a scheduled job in the jobs framework. The scheduled +job gets created (or modified) by the syntax detailed in the previous section - +there is one scheduled job per table with TTL. + +The deletes issued by the job are a SQL-based delete, meaning they will show up +on changefeeds as a regular delete by the root user (we may choose to create a +new user for this in the future). Using SQL based deletes has the nice property +of handling deletes on secondary index entries automatically, and can form the +basis of correctly handling foreign keys or triggers in the future. + +### Job Algorithm + +On each job invocation, the deletion job performs the following: +* Establish a "start time". +* Paginate over all ranges for the table. Allocate these to a "range worker pool". +* Then, for each range worker performs the following: + * Traverse the range in order of PRIMARY KEY: + * Issue a SQL SELECT AS OF SYSTEM TIME '-30s' query on the PRIMARY KEYs up to + select_batch_size entries in which the TTL column is less than the established + start time (note we need to establish the start time before we run or the job + could potentially never finish). + * Issue a SQL DELETE up to delete_batch_size entries from the table using the + PRIMARY KEYs selected above. Since the row may be updated between the above + SELECT and this DELETE, we also add a clause here to ensure the row has + truly expired TTL when deleting. + +### Admission Control +To ensure the deletion job does not affect foreground traffic, we plan on using +[admission control](admission control) on a SQL transaction at a low value +(`-200`). This leaves room for lower values in future. + +From previous experimentation, this largely regulates the amount of knob +controls users may need to configure, making this a "configure once, walk away" +process which does not need heavy monitoring. + +### Transaction Priority +We will also run the DELETE queries under a lower transaction priority so that +any contending transaction would be able to abort the DELETE and force it to +retry instead of waiting on its locks. + +### Rate Limiting +In an ideal world, admission control could reject traffic and force a backoff +and retry if it detects heavy load and rate limiting is not needed. However, +we will introduce a rate limiter option just in case admission control is not +adequate, especially in its infancy of being exposed. We can revisit this at a +later point. + +### Controlling Delete Rate +Whilst our aim is for foreground traffic to be unaffected whilst deletion rate +manages to clean up "enough", we expect that there is still tinkering developers +must make to ensure a rate for their workload. + +As such, users have knobs that control the delete rate, including: +* how often the deletion job runs (controls amount of "junk" data left) +* table GC time (when tombstones are removed and space is therefore reclaimed) +* the size of the ranges on the table, which has knock on effects for + `ttl_range_concurrency`. + +As part of the `(option = value, …)` syntax, users can also control the +following on the deletion job itself: + +Option | Description +--- | --- +`ttl_expire_after` | When a TTL would expire. Accepts any interval. Defaults to ''30 days''. Minimum of `'5 minutes'`. +`ttl_expiration_expression` | If set, uses the expression specified as the TTL expiration. Defaults to just using the `crdb_internal_expiration` column. +`ttl_select_batch_size` | How many rows to fetch from the range that have expired at a given time. Defaults to 500. Must be at least `1`. +`ttl_delete_batch_size` | How many rows to delete at a time. Defaults to 100. Must be at least `1`. +`ttl_range_concurrency` | How many concurrent ranges are being worked on at a time. Defaults to `cpu_core_count`. Must be at least `1`. +`ttl_admission_control_priority` | Priority of the admission control to use when deleting. Set to a high amount if deletion traffic is the priority. +`ttl_select_as_of_system_time` | AS OF SYSTEM TIME to use for the SELECT clause. Defaults to `-30s`. Proposed as a hidden option, as I'm not sure anyone will actually need it. +`ttl_maximum_rows_deleted_per_second` | Maximum number of rows to be deleted per second (acts as the rate limit). Defaults to 0 (signifying none). +`ttl_pause` | Pauses the TTL job from executing. +`ttl_job_cron` | Frequency the job runs, specified using the CRON syntax. + +#### A note on delete_batch_size +Deleting rows on tables with secondary indexes on a table lays more intents, +hence causing higher contention. In this case, there is an argument to delete +less rows at a time (or indeed, just one row at a time!) to avoid this. However, +testing showed deletes could never catch up in this case and realistically a +very small number of rows can be deleted in this way. "100" was set as the +balance for now. + +## Filtering out Expired Rows +Rows that have expired their TTL can be optionally removed from all SQL query +results. Work for this is required at the optimizer layer. However, this is not +planned to be implemented for the first release of TTL. + +## Overriding the default value for the `crdb_internal_expiration` column + +If we were to stick one column, the user can simply update the +`crdb_internal_expiration` value to "override" the TTL. However, with our +current defaults, this value would get overridden if the row is updated due to +the `ON UPDATE` clause. The `ON UPDATE` should ideally be: + +```sql +CASE WHEN crdb_internal_expiration IS NULL +-- NULL means the row should never be TTL'd +THEN NULL +-- otherwise, take whatever's larger: the user set crdb_internal_expiration +-- value or an update TTL based on the current_timestamp. +ELSE max(crdb_internal_expiration, current_timestamp() + 'table ttl value') +``` + +However, due to the limitation that `ON UPDATE` cannot reference a table +expression, this cannot yet be implemented. This can change when we implement +[triggers](https://github.com/cockroachdb/cockroach/issues/28296). + +To counteract this, we will add a `ttl_expiration_expression` option which adds +an expression to use as the value for TTL expression, which replaces the TTL +expiration condition. This expiration must resolve to a TIMESTAMPTZ. + + +For example, we will have: + +```sql +CREATE TABLE foo ( + i INT PRIMARY KEY, + should_delete BOOL +) WITH ( + ttl_expire_after = '1m', + ttl_expiration_expression = 'if(should_delete, crdb_expiration, NULL)' +); +``` + +This would change the deletion job to use `if(should_delete, crdb_expiration, +NULL)` as the expiration timestamp for choosing rows to delete. Users can add +other columns to use as part of their TTL definition if they so choose. + +## Foreign Keys +To avoid additional complexity in the initial implementation, foreign keys to or +from TTL tables will not be permitted. More thought has to be put on ON +DELETE/ON UPDATE CASCADEs before we can look at allowing this functionality. + +## Introspection +The TTL definition for the table will appear in `SHOW CREATE TABLE`. The options +for the TTL job on the table can be found on the `pg_class` table under +`reloptions`. + +## Observability +As mentioned above, we expect users may need to tinker with the deletion job to +ensure foreground traffic is not affected. + +We will also expose a number of metrics on the job to help gauge deletion +progress which is exposed as a prometheus metric with a `tablename` label. + +The following can be monitored using metrics graphs on the DB console: +* selection and deletion rate (`sql.ttl.select_rate`, `sql.ttl.delete_rate`) +* selection and deletion latency (`sql.ttl.select_latency`, `sql.ttl.delete_latency`) + +In the future, we may also introduce a mechanism to monitor "lag rate" by +exposing a denormalized table metadata field or metric similar to +`estimated_row_count` called `estimated_expired_row_count`, which counts the +number of rows which are estimated to not have expired TTL. Users will be able +to poll this to determine the effectiveness of the TTL deletion job. This can be +handled by the automatic statistics job. + +## Alternatives + +### KV Approach to Row Expiration +There was a major competitor to the "SQL-based deletion approach" - having the +KV be the layer responsible for TTL. The existing GC job would be responsible +for automatically clearing these rows. The KV layer would not return any row +which has expired from the TTL, in theory also removing contention from intent +resolution thus reducing foreground performance issues. + +The main downside of this approach is that is it not "SQL aware", resulting in +a few problems: +* Secondary index entries would need some reconciliation to be garbage + collected. As KV is not aware which rows in the secondary index need to be + deleted from the primary index, this would represent blocking TTL tables + allowing row-level TTL. + * We can make all of these changes atomically without needing to coordinate + with concurrent transactions. If we did filtering during query execution + and we made a change to the semantics of what was filtered, we might need + to be careful about how we roll out those versions to users. Namely, we + cannot start deleting data until nobody thinks that data might be live. + This toolkit exists if we chose to go to it. +* If we wanted CDC to work, tombstones would need to be written by the KV GC + process. This adds further complexity to CDC. + +As row-level TTL is a "SQL level" feature, it makes sense that something in the +SQL layer would be most appropriate to handle it. See [comparison +doc](comparison doc) for other observations. + +### Alternative TTL columns +Another proposal for TTL columns was to have two columns: +* a `last_updated` column which is the last update timestamp of the column. +* a `ttl` column which is the interval since `last_updated` before the row + gets deleted. + +An advantage here is that it is easy to specify an "override" - bump the +`ttl` interval value to be some high number, and `last_updated` still gets +updated. + +We opted against this as we felt it was less obvious than the absolute +timestamp long term vision. + +### Alternative TTL SQL syntax +An alternative SQL syntax for TTL would be to make TTL its own clause in a +CREATE TABLE / ALTER TABLE. + +The syntax would look as follows: + +```sql +-- creating a table with TTL +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) TTL; +-- creating a table with TTL options +CREATE TABLE tbl ( + id INT PRIMARY KEY, + text TEXT +) TTL (ttl_expire_after = '5 mins', delete_batch_size = 1, job_cron = '* * 1 * *'); +-- adding a new TTL +ALTER TABLE ttl SET TTL; -- starts a TTL with a default of 30 days. +ALTER TABLE ttl SET TTL (ttl_expire_after = '5 mins', delete_batch_size = 1, job_cron = '* * 1 * *'); +-- dropping a TTL +ALTER TABLE tbl DROP TTL; +``` + +The main attraction of this is that it is clear TTL is being added or dropped. +However, we have a mild preference for re-using the `storage_parameter` `WITH` +syntax from PostgreSQL, and that's what we ended up going with. + +## Future Improvements + +### CDC DELETE annotations +We may determine in future that DELETEs issued by a TTL job are a "special type" +of delete that users can process. + +### Secondary index deletion improvements +We can improve deletion speed on secondary indexes if we split the deletes for +secondary indexes and the primary index. This way, we have a lot more +flexibility in terms of which writes we batch. + +What this allows us to do is throw thousands of rows in a big buffer, decompose +their individual writes, bucket them by range, and send a batch per range. This +has two benefits, both of which are very significant: +* we can easily build up large batches of 100s of writes for each range, + which will stay batched all the way through Raft and down to Pebble. +* these batches don’t run a distributed transaction protocol, so they hit the + 1PC fast path instead of writing intents, running a 2-phase commit protocol, + then cleaning up those intents. + +This is predicated on filtering out expired rows working as otherwise users +could miss entries when querying the secondary index as opposed to the primary. + +### Improve the deletion loopl +We can speed up the deletion by using indexes if one was created on the TTL +column for the table. + +## Open Questions +N/A + +[#20239]: https://github.com/cockroachdb/cockroach/issues/20239 +[#75189]: https://github.com/cockroachdb/cockroach/pull/75189 +[cockroach TTL advice]: https://www.cockroachlabs.com/docs/stable/bulk-delete-data.html +[admission control]: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/admission_control.md +[comparison doc]: https://docs.google.com/document/d/1HkFg3S-k3s2PahPRQhTgUkCR4WIAtjkSNVylarMC-gY/edit#heading=h.o6cn5faoiokv