-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Release note: None
- Loading branch information
Showing
1 changed file
with
335 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,335 @@ | ||
# Row Level TTL | ||
* Feature Name: Row-level TTL | ||
* Status: in-progress | ||
* Start Date: 2021-12-14 | ||
* Authors: Oliver Tan | ||
* RFC PR: [#75189](75189) | ||
* Cockroach Issue: [#20239](20239) | ||
|
||
# Summary | ||
Row-level "time to live" (TTL) is a mechanism in which rows from a table | ||
automatically get deleted once the row surpasses an expiration time (the "TTL"). | ||
This has been a [feature commonly asked for](20239). | ||
|
||
This RFC proposes a CockroachDB level mechanism to support row-level TTL, where | ||
rows will be deleted after a certain period of time. As a further extension in a | ||
later release, rows will automatically be hidden after expiring a TTL. | ||
|
||
The following declarative syntaxes will initialize a table with row-level TTL: | ||
```sql | ||
CREATE TABLE tbl ( | ||
id INT PRIMARY KEY, | ||
text TEXT | ||
) TTL '5 minutes' | ||
``` | ||
By implementing row-level TTL, we are saving developers from writing a complex | ||
scheduled job and additional application logic to handle this functionality. It | ||
also brings us to parity with other database systems which support this feature. | ||
|
||
**Note**: this does NOT cover partition level deletes, where deletes can be | ||
*"accelerated" by deleting entire "partitions" at a time. | ||
|
||
# Motivation | ||
Today, developers who need row-level TTL need to roll out their own mechanism | ||
for deleting rows as well as adding application logic to filter out the expired | ||
rows. | ||
|
||
The job itself can get complex to write. We have a [guide](cockroach TTL advice) | ||
which was difficult to perfect - when we implemented it ourselves, we found | ||
there were multiple issues related to performance. Developers have to implement | ||
and manage several knobs to balance deletion time and performance on foreground | ||
traffic (traffic from the application): | ||
* how to shard out the delete & run the deletes in parallel | ||
* how many rows to SELECT and DELETE at once | ||
* a rate limit to control the rate of SELECTs and DELETEs | ||
|
||
Furthermore, developers require productionizing the jobs themselves, adding more | ||
moving pieces in their production environment. | ||
|
||
Having a row-level TTL is sought often enough that warrants being a feature | ||
supported by CockroachDB natively, avoiding the complexity of another moving | ||
part and complex logic from developers and making data easy. | ||
|
||
# Technical design | ||
|
||
## Syntax and Table Metadata | ||
|
||
### Table Creation | ||
TTLs are defined on a per-table basis. Developers can create a table with a TTL | ||
using the following syntax: | ||
|
||
```sql | ||
CREATE TABLE tbl ( | ||
id INT PRIMARY KEY, | ||
text TEXT | ||
) TTL | ||
``` | ||
|
||
Users can add additional options to control their TTL (see full list of options | ||
below): | ||
|
||
```sql | ||
CREATE TABLE tbl ( | ||
id INT PRIMARY KEY, | ||
text TEXT | ||
) TTL (expiration_time = '5 minutes', delete_batch_size = 1, job_cron = '* * 1 * *') | ||
``` | ||
|
||
This automatically creates a repeating scheduled job for the given table, as | ||
well as adding the HIDDEN column `crdb_internal_expiration` to symbolize the | ||
TTL: | ||
|
||
```sql | ||
CREATE TABLE tbl ( | ||
id INT PRIMARY KEY, | ||
text TEXT, | ||
crdb_internal_expiration TIMESTAMPTZ NOT NULL DEFAULT current_timestamp() + '5 minutes' ON UPDATE current_timestamp() + '5 minutes' | ||
) TTL (expiration_time = '5 minutes') | ||
``` | ||
|
||
A user can override the crdb_internal_expiration expression by setting a custom | ||
option for the column to represent as a TTL: | ||
|
||
```sql | ||
CREATE TABLE tbl ( | ||
id INT PRIMARY KEY, | ||
text TEXT, | ||
other_ts_field TIMESTAMPTZ | ||
) TTL (expiration_column = 'other_ts_field') | ||
``` | ||
|
||
TTL metadata is stored on the TableDescriptor: | ||
```protobuf | ||
message TableDescriptor { | ||
// … | ||
message TTL { | ||
option (gogoproto.equal) = true; | ||
message RowLevelTTL { | ||
// DeleteBatchSize is the number of rows to delete in each batch. | ||
optional int64 delete_batch_size = 1; // defaults to 100 | ||
// SelectBatchSize is the number of rows to select at a time. | ||
optional int64 select_batch_size = 2; // defaults to 500 | ||
// MaximumRowsDeletedPerSecond controls the amount of rows to delete per second. | ||
optional int64 maximum_rows_deleted_per_second = 3; // defaults to 0 (no limit) | ||
// RangeConcurrency controls the amount of ranges to delete at a time. Defaults to 8. | ||
optional int64 range_concurrency = 4; | ||
} | ||
// DurationExpr is the automatically assigned interval for when the TTL should apply to a row. | ||
optional string duration_expr = 1 [(gogoproto.nullable)=false]; | ||
optional string deletion_cron = 2 [(gogoproto.nullable)=false]; | ||
oneof details { | ||
RowLevelTTL row_level_ttl = 3 [(gogoproto.customname="RowLevelTTL")]; | ||
// reserved for partitioning | ||
} | ||
} | ||
optional TTL ttl = 47 [(gogoproto.customname)="TTL"]; | ||
} | ||
``` | ||
|
||
#### User override for the `crdb_internal_expiration` column | ||
In an ideal world, the user can simply update the `crdb_internal_expiration` | ||
value to "override" the TTL. However, with our current defaults, this value | ||
would get overridden if the row is updated due to the `ON UPDATE` clause. The | ||
`ON UPDATE` should ideally be: | ||
|
||
```sql | ||
CASE WHEN crdb_internal_expiration IS NULL | ||
-- NULL means the row should never be TTL'd | ||
THEN NULL | ||
-- otherwise, take whatever's larger: the user set crdb_internal_expiration | ||
-- value or an update TTL based on the current_timestamp. | ||
ELSE max(crdb_internal_expiration, current_timestamp() + 'table ttl value') | ||
``` | ||
|
||
However, due to the limitation that `ON UPDATE` cannot reference a table | ||
expression, this cannot yet be implemented. This can change when we implemented | ||
[triggers](https://github.com/cockroachdb/cockroach/issues/28296). | ||
|
||
A workaround may be planned after the first iteration, but it is not deemed | ||
a blocking feature for the first iteration of TTL. | ||
|
||
### Applying or Altering TTL for a table | ||
TTL can be configured using `ALTER TABLE`: | ||
|
||
```sql | ||
ALTER TABLE tbl SET TTL; -- introduce a TTL. no-op if there is already a TTL | ||
ALTER TABLE tbl SET TTL (expiration_column = 'other_ts_field'); | ||
-- introduce a TTL with the given option OR apply the given option to the TTL | ||
ALTER TABLE tbl SET TTL (expiration_column = NULL) -- drop an option from the TTL | ||
``` | ||
|
||
Note initially any options applied to the TTL in regards to the job will not | ||
apply whilst the job is running; the user must restart the job for the settings | ||
to take effect. A HINT will be displayed to the user if this is required. | ||
|
||
### Dropping TTL for a table | ||
TTL can be dropped using `ALTER TABLE`: | ||
|
||
```sql | ||
ALTER TABLE tbl DROP TTL | ||
``` | ||
|
||
The DROP will not drop any TTL columns. However, recreating the TTL column will | ||
not succeed as it may risk expiring all the rows immediately. Instead, the user | ||
must either DROP the existing TTL columns or specify it as the `ttl_column` | ||
option. A hint will be provided detailing this. Deletion Process When a TTL | ||
table is created, we create a SCHEDULED JOB which defaults to running daily. | ||
|
||
## The Deletion Job | ||
|
||
Row deletion is handled by a scheduled job in the jobs framework. The scheduled | ||
job gets created (or modified) by the syntax detailed in the previous section. | ||
The deletes are a SQL-based delete, meaning they will show up on changefeeds | ||
as a regular delete by the root user. | ||
|
||
### Job Algorithm | ||
|
||
On each job invocation, the deletion job performs the following: | ||
* Establish a "start time". | ||
* Paginate over all ranges for the table. Allocate these to a "range worker pool". | ||
* Then, for each range worker performs the following: | ||
* Traverse the range in order of PRIMARY KEY: | ||
* Issue a SQL SELECT AS OF SYSTEM TIME '-30s' query on the PRIMARY KEYs up to | ||
select_batch_size entries in which the TTL column is less than the established | ||
start time (note we need to establish the start time before we run or the job | ||
could potentially never finish). | ||
* Issue a SQL DELETE up to delete_batch_size entries from the table using the | ||
PRIMARY KEYs selected above. | ||
|
||
### Admission Control | ||
To ensure the deletion job does not affect foreground traffic, we plan on using | ||
[admission control](admission control) on a SQL transaction at the lowest | ||
possible value (-256). | ||
|
||
From previous experimentation, this largely regulates the amount of knob | ||
controls users may need to configure, making this a "configure once, walk away" | ||
process which does not need heavy monitoring. | ||
|
||
### Rate Limiting | ||
In an ideal world, admission control could reject traffic and force a backoff | ||
and retry if it detects heavy load and rate limiting is not needed. However, | ||
we will introduce a rate limiter option just in case admission control is not | ||
adequate, especially in its infancy of being exposed. We can revisit this at a | ||
later point. | ||
|
||
### Controlling Delete Rate | ||
Whilst our aim is for foreground traffic to be unaffected whilst deletion rate | ||
manages to clean up "enough", we expect that there is still tinkering developers | ||
must make to ensure a rate for their workload. | ||
|
||
As such, users have knobs they can control to control the delete rate, | ||
including: how often the deletion job runs (controls amount of "junk" data left) | ||
GC time (when tombstones are removed and space is therefore reclaimed) the size | ||
of the ranges, which has knock on effects for `range_concurrency`. | ||
|
||
As part of the `TTL (option = value, …)` syntax, users can control the following | ||
options: | ||
|
||
Option | Description | ||
--- | --- | ||
select_batch_size | How many rows to fetch from the range that have expired at a given time. Defaults to 500. | ||
delete_batch_size | How many rows to delete at a time. Defaults to 100. | ||
range_concurrency | How many concurrent ranges are being worked on at a time. Defaults to 8. | ||
admission_control_priority | Priority of the admission control to use when deleting. Set to a high amount if deletion traffic is the priority. | ||
select_as_of_system_time | AS OF SYSTEM TIME to use for the SELECT clause. Defaults to `-30s`. Proposed as a hidden option, as I'm not sure anyone will actually need it. | ||
maximum_rows_deleted_per_second | Maximum number of rows to be deleted per second (acts as the rate limit). Defaults to 0 (signifying none) | ||
pause | Pauses the TTL job from executing. Existing jobs will need to be paused using CANCEL JOB. | ||
job_cron | Frequency the job runs. | ||
|
||
#### A note on delete_batch_size | ||
Deleting rows on tables with secondary indexes on a table lays more intents, | ||
hence causing higher contention. In this case, there is an argument to delete | ||
less rows at a time (or indeed, just one row at a time!) to avoid this. However, | ||
testing showed deletes could never catch up in this case and realistically a | ||
very small number of rows can be deleted in this way. "100" was set as the | ||
balance for now. | ||
|
||
## Filtering out Expired Rows | ||
Rows that have expired their TTL can be optionally removed from all SQL query | ||
results. However, this is not planned to be implemented for the first release of | ||
TTL. | ||
|
||
## Foreign Keys | ||
To avoid additional complexity in the initial implementation, foreign keys to or | ||
from TTL tables will not be permitted. More thought has to be put on ON | ||
DELETE/ON UPDATE CASCADEs before we can look at allowing this functionality. | ||
|
||
## Observability | ||
As mentioned above, we expect users will need to tinker with the deletion job to | ||
ensure foreground traffic is not affected. | ||
|
||
To monitor "lag rate", we will expose a metric similar to `estimated_row_count` | ||
called `estimated_active_row_count`, which counts the number of rows which are | ||
estimated to not have expired TTL. Users will be able to poll this to determine | ||
the effectiveness of the TTL deletion job. | ||
|
||
We will also expose a number of metrics on the job to help gauge deletion | ||
progress: | ||
* selection and deletion rate | ||
* selection and deletion latency | ||
|
||
## Alternatives | ||
|
||
### KV Approach to Row Expiration | ||
There was a major competitor to the "SQL-based deletion approach" - having the | ||
KV be the layer responsible for TTL. The existing GC job would be responsible | ||
for automatically clearing these rows. The KV layer would not return any row | ||
which has expired from the TTL, in theory also removing contention from intent | ||
resolution thus reducing foreground performance issues. | ||
|
||
The main downside of this approach is that is it not "SQL aware", resulting in | ||
a few problems: | ||
* Secondary index entries would need some reconciliation to be garbage | ||
collected. As KV is not aware which rows in the secondary index need to be | ||
deleted from the primary index, this would represent blocking TTL tables | ||
allowing row-level TTL. | ||
* If we wanted CDC to work, tombstones would need to be written by the KV GC | ||
process. This adds further complexity to CDC. | ||
|
||
As row-level TTL is a "SQL level" feature, it makes sense that something in the | ||
SQL layer would be most appropriate to handle it. See [comparison | ||
doc](comparison doc) for other observations. | ||
|
||
|
||
### Alternative TTL columns | ||
Another proposal for TTL columns was to have two columns: | ||
* a `last_updated` column which is the last update timestamp of the column. | ||
* a `ttl` column which is the interval since last_updated before the row | ||
gets deleted. | ||
|
||
An advantage here is that it is easy to specify an "override" - bump the | ||
`ttl` interval value to be some high number, and `last_updated` still gets | ||
updated. | ||
|
||
We opted against this as we felt it was less obvious than the absolute | ||
timestamp long term vision. | ||
|
||
### Alternative TTL syntax | ||
An alternative SQL syntax for TTL would be to use the existing | ||
`WITH (storage_parameter = value)` syntax PostgreSQL provides. However, | ||
the experience of this is perceived as a little clunky in terms of enabling | ||
or disabling TTL compared to the currently proposed syntax as it is unclear | ||
that the table columns themselves may be modified. | ||
|
||
## Future Improvements | ||
* Special annotation on TTL deletes for CDC: we may determine in future that | ||
DELETEs issued by a TTL job are a "special type" of delete that users can | ||
process. | ||
* Secondary index deletion improvements: in future, we can remove the | ||
performance hit when deleting secondary indexes by doing a delete on the | ||
secondary index entry before the delete on the primary index entry. This is | ||
predicated on filtering out expired rows working. | ||
* Admission control improvements: In future, we can explore applying | ||
backpressure from admission control into the job to automatically tune | ||
deletion rate to least affect foreground traffic. | ||
|
||
## Open Questions | ||
N/A | ||
|
||
[#20239]: https://github.com/cockroachdb/cockroach/issues/20239 | ||
[#75189]: https://github.com/cockroachdb/cockroach/pull/75189 | ||
[cockroach TTL advice]: https://www.cockroachlabs.com/docs/stable/bulk-delete-data.html | ||
[admission control]: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/admission_control.md | ||
[comparison doc]: https://docs.google.com/document/d/1HkFg3S-k3s2PahPRQhTgUkCR4WIAtjkSNVylarMC-gY/edit#heading=h.o6cn5faoiokv |