Skip to content

Commit

Permalink
rfc: row level TTL
Browse files Browse the repository at this point in the history
Release note: None
  • Loading branch information
otan committed Jan 20, 2022
1 parent d990758 commit d70c8bb
Showing 1 changed file with 335 additions and 0 deletions.
335 changes: 335 additions & 0 deletions docs/RFCS/20220120_row_level_ttl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,335 @@
# Row Level TTL
* Feature Name: Row-level TTL
* Status: in-progress
* Start Date: 2021-12-14
* Authors: Oliver Tan
* RFC PR: [#75189](75189)
* Cockroach Issue: [#20239](20239)

# Summary
Row-level "time to live" (TTL) is a mechanism in which rows from a table
automatically get deleted once the row surpasses an expiration time (the "TTL").
This has been a [feature commonly asked for](20239).

This RFC proposes a CockroachDB level mechanism to support row-level TTL, where
rows will be deleted after a certain period of time. As a further extension in a
later release, rows will automatically be hidden after expiring a TTL.

The following declarative syntaxes will initialize a table with row-level TTL:
```sql
CREATE TABLE tbl (
id INT PRIMARY KEY,
text TEXT
) TTL '5 minutes'
```
By implementing row-level TTL, we are saving developers from writing a complex
scheduled job and additional application logic to handle this functionality. It
also brings us to parity with other database systems which support this feature.

**Note**: this does NOT cover partition level deletes, where deletes can be
*"accelerated" by deleting entire "partitions" at a time.

# Motivation
Today, developers who need row-level TTL need to roll out their own mechanism
for deleting rows as well as adding application logic to filter out the expired
rows.

The job itself can get complex to write. We have a [guide](cockroach TTL advice)
which was difficult to perfect - when we implemented it ourselves, we found
there were multiple issues related to performance. Developers have to implement
and manage several knobs to balance deletion time and performance on foreground
traffic (traffic from the application):
* how to shard out the delete & run the deletes in parallel
* how many rows to SELECT and DELETE at once
* a rate limit to control the rate of SELECTs and DELETEs

Furthermore, developers require productionizing the jobs themselves, adding more
moving pieces in their production environment.

Having a row-level TTL is sought often enough that warrants being a feature
supported by CockroachDB natively, avoiding the complexity of another moving
part and complex logic from developers and making data easy.

# Technical design

## Syntax and Table Metadata

### Table Creation
TTLs are defined on a per-table basis. Developers can create a table with a TTL
using the following syntax:

```sql
CREATE TABLE tbl (
id INT PRIMARY KEY,
text TEXT
) TTL
```

Users can add additional options to control their TTL (see full list of options
below):

```sql
CREATE TABLE tbl (
id INT PRIMARY KEY,
text TEXT
) TTL (expiration_time = '5 minutes', delete_batch_size = 1, job_cron = '* * 1 * *')
```

This automatically creates a repeating scheduled job for the given table, as
well as adding the HIDDEN column `crdb_internal_expiration` to symbolize the
TTL:

```sql
CREATE TABLE tbl (
id INT PRIMARY KEY,
text TEXT,
crdb_internal_expiration TIMESTAMPTZ NOT NULL DEFAULT current_timestamp() + '5 minutes' ON UPDATE current_timestamp() + '5 minutes'
) TTL (expiration_time = '5 minutes')
```

A user can override the crdb_internal_expiration expression by setting a custom
option for the column to represent as a TTL:

```sql
CREATE TABLE tbl (
id INT PRIMARY KEY,
text TEXT,
other_ts_field TIMESTAMPTZ
) TTL (expiration_column = 'other_ts_field')
```

TTL metadata is stored on the TableDescriptor:
```protobuf
message TableDescriptor {
// …
message TTL {
option (gogoproto.equal) = true;
message RowLevelTTL {
// DeleteBatchSize is the number of rows to delete in each batch.
optional int64 delete_batch_size = 1; // defaults to 100
// SelectBatchSize is the number of rows to select at a time.
optional int64 select_batch_size = 2; // defaults to 500
// MaximumRowsDeletedPerSecond controls the amount of rows to delete per second.
optional int64 maximum_rows_deleted_per_second = 3; // defaults to 0 (no limit)
// RangeConcurrency controls the amount of ranges to delete at a time. Defaults to 8.
optional int64 range_concurrency = 4;
}
// DurationExpr is the automatically assigned interval for when the TTL should apply to a row.
optional string duration_expr = 1 [(gogoproto.nullable)=false];
optional string deletion_cron = 2 [(gogoproto.nullable)=false];
oneof details {
RowLevelTTL row_level_ttl = 3 [(gogoproto.customname="RowLevelTTL")];
// reserved for partitioning
}
}
optional TTL ttl = 47 [(gogoproto.customname)="TTL"];
}
```

#### User override for the `crdb_internal_expiration` column
In an ideal world, the user can simply update the `crdb_internal_expiration`
value to "override" the TTL. However, with our current defaults, this value
would get overridden if the row is updated due to the `ON UPDATE` clause. The
`ON UPDATE` should ideally be:

```sql
CASE WHEN crdb_internal_expiration IS NULL
-- NULL means the row should never be TTL'd
THEN NULL
-- otherwise, take whatever's larger: the user set crdb_internal_expiration
-- value or an update TTL based on the current_timestamp.
ELSE max(crdb_internal_expiration, current_timestamp() + 'table ttl value')
```

However, due to the limitation that `ON UPDATE` cannot reference a table
expression, this cannot yet be implemented. This can change when we implemented
[triggers](https://github.com/cockroachdb/cockroach/issues/28296).

A workaround may be planned after the first iteration, but it is not deemed
a blocking feature for the first iteration of TTL.

### Applying or Altering TTL for a table
TTL can be configured using `ALTER TABLE`:

```sql
ALTER TABLE tbl SET TTL; -- introduce a TTL. no-op if there is already a TTL
ALTER TABLE tbl SET TTL (expiration_column = 'other_ts_field');
-- introduce a TTL with the given option OR apply the given option to the TTL
ALTER TABLE tbl SET TTL (expiration_column = NULL) -- drop an option from the TTL
```

Note initially any options applied to the TTL in regards to the job will not
apply whilst the job is running; the user must restart the job for the settings
to take effect. A HINT will be displayed to the user if this is required.

### Dropping TTL for a table
TTL can be dropped using `ALTER TABLE`:

```sql
ALTER TABLE tbl DROP TTL
```

The DROP will not drop any TTL columns. However, recreating the TTL column will
not succeed as it may risk expiring all the rows immediately. Instead, the user
must either DROP the existing TTL columns or specify it as the `ttl_column`
option. A hint will be provided detailing this. Deletion Process When a TTL
table is created, we create a SCHEDULED JOB which defaults to running daily.

## The Deletion Job

Row deletion is handled by a scheduled job in the jobs framework. The scheduled
job gets created (or modified) by the syntax detailed in the previous section.
The deletes are a SQL-based delete, meaning they will show up on changefeeds
as a regular delete by the root user.

### Job Algorithm

On each job invocation, the deletion job performs the following:
* Establish a "start time".
* Paginate over all ranges for the table. Allocate these to a "range worker pool".
* Then, for each range worker performs the following:
* Traverse the range in order of PRIMARY KEY:
* Issue a SQL SELECT AS OF SYSTEM TIME '-30s' query on the PRIMARY KEYs up to
select_batch_size entries in which the TTL column is less than the established
start time (note we need to establish the start time before we run or the job
could potentially never finish).
* Issue a SQL DELETE up to delete_batch_size entries from the table using the
PRIMARY KEYs selected above.

### Admission Control
To ensure the deletion job does not affect foreground traffic, we plan on using
[admission control](admission control) on a SQL transaction at the lowest
possible value (-256).

From previous experimentation, this largely regulates the amount of knob
controls users may need to configure, making this a "configure once, walk away"
process which does not need heavy monitoring.

### Rate Limiting
In an ideal world, admission control could reject traffic and force a backoff
and retry if it detects heavy load and rate limiting is not needed. However,
we will introduce a rate limiter option just in case admission control is not
adequate, especially in its infancy of being exposed. We can revisit this at a
later point.

### Controlling Delete Rate
Whilst our aim is for foreground traffic to be unaffected whilst deletion rate
manages to clean up "enough", we expect that there is still tinkering developers
must make to ensure a rate for their workload.

As such, users have knobs they can control to control the delete rate,
including: how often the deletion job runs (controls amount of "junk" data left)
GC time (when tombstones are removed and space is therefore reclaimed) the size
of the ranges, which has knock on effects for `range_concurrency`.

As part of the `TTL (option = value, …)` syntax, users can control the following
options:

Option | Description
--- | ---
select_batch_size | How many rows to fetch from the range that have expired at a given time. Defaults to 500.
delete_batch_size | How many rows to delete at a time. Defaults to 100.
range_concurrency | How many concurrent ranges are being worked on at a time. Defaults to 8.
admission_control_priority | Priority of the admission control to use when deleting. Set to a high amount if deletion traffic is the priority.
select_as_of_system_time | AS OF SYSTEM TIME to use for the SELECT clause. Defaults to `-30s`. Proposed as a hidden option, as I'm not sure anyone will actually need it.
maximum_rows_deleted_per_second | Maximum number of rows to be deleted per second (acts as the rate limit). Defaults to 0 (signifying none)
pause | Pauses the TTL job from executing. Existing jobs will need to be paused using CANCEL JOB.
job_cron | Frequency the job runs.

#### A note on delete_batch_size
Deleting rows on tables with secondary indexes on a table lays more intents,
hence causing higher contention. In this case, there is an argument to delete
less rows at a time (or indeed, just one row at a time!) to avoid this. However,
testing showed deletes could never catch up in this case and realistically a
very small number of rows can be deleted in this way. "100" was set as the
balance for now.

## Filtering out Expired Rows
Rows that have expired their TTL can be optionally removed from all SQL query
results. However, this is not planned to be implemented for the first release of
TTL.

## Foreign Keys
To avoid additional complexity in the initial implementation, foreign keys to or
from TTL tables will not be permitted. More thought has to be put on ON
DELETE/ON UPDATE CASCADEs before we can look at allowing this functionality.

## Observability
As mentioned above, we expect users will need to tinker with the deletion job to
ensure foreground traffic is not affected.

To monitor "lag rate", we will expose a metric similar to `estimated_row_count`
called `estimated_active_row_count`, which counts the number of rows which are
estimated to not have expired TTL. Users will be able to poll this to determine
the effectiveness of the TTL deletion job.

We will also expose a number of metrics on the job to help gauge deletion
progress:
* selection and deletion rate
* selection and deletion latency

## Alternatives

### KV Approach to Row Expiration
There was a major competitor to the "SQL-based deletion approach" - having the
KV be the layer responsible for TTL. The existing GC job would be responsible
for automatically clearing these rows. The KV layer would not return any row
which has expired from the TTL, in theory also removing contention from intent
resolution thus reducing foreground performance issues.

The main downside of this approach is that is it not "SQL aware", resulting in
a few problems:
* Secondary index entries would need some reconciliation to be garbage
collected. As KV is not aware which rows in the secondary index need to be
deleted from the primary index, this would represent blocking TTL tables
allowing row-level TTL.
* If we wanted CDC to work, tombstones would need to be written by the KV GC
process. This adds further complexity to CDC.

As row-level TTL is a "SQL level" feature, it makes sense that something in the
SQL layer would be most appropriate to handle it. See [comparison
doc](comparison doc) for other observations.


### Alternative TTL columns
Another proposal for TTL columns was to have two columns:
* a `last_updated` column which is the last update timestamp of the column.
* a `ttl` column which is the interval since last_updated before the row
gets deleted.

An advantage here is that it is easy to specify an "override" - bump the
`ttl` interval value to be some high number, and `last_updated` still gets
updated.

We opted against this as we felt it was less obvious than the absolute
timestamp long term vision.

### Alternative TTL syntax
An alternative SQL syntax for TTL would be to use the existing
`WITH (storage_parameter = value)` syntax PostgreSQL provides. However,
the experience of this is perceived as a little clunky in terms of enabling
or disabling TTL compared to the currently proposed syntax as it is unclear
that the table columns themselves may be modified.

## Future Improvements
* Special annotation on TTL deletes for CDC: we may determine in future that
DELETEs issued by a TTL job are a "special type" of delete that users can
process.
* Secondary index deletion improvements: in future, we can remove the
performance hit when deleting secondary indexes by doing a delete on the
secondary index entry before the delete on the primary index entry. This is
predicated on filtering out expired rows working.
* Admission control improvements: In future, we can explore applying
backpressure from admission control into the job to automatically tune
deletion rate to least affect foreground traffic.

## Open Questions
N/A

[#20239]: https://github.com/cockroachdb/cockroach/issues/20239
[#75189]: https://github.com/cockroachdb/cockroach/pull/75189
[cockroach TTL advice]: https://www.cockroachlabs.com/docs/stable/bulk-delete-data.html
[admission control]: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/admission_control.md
[comparison doc]: https://docs.google.com/document/d/1HkFg3S-k3s2PahPRQhTgUkCR4WIAtjkSNVylarMC-gY/edit#heading=h.o6cn5faoiokv

0 comments on commit d70c8bb

Please sign in to comment.