Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: AddSSTable option to write at current timestamp #70422

Closed
erikgrinaker opened this issue Sep 20, 2021 · 2 comments · Fixed by #72085
Closed

kvserver: AddSSTable option to write at current timestamp #70422

erikgrinaker opened this issue Sep 20, 2021 · 2 comments · Fixed by #72085
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 20, 2021

As described in #69380, the fact that AddSSTable can write keys with timestamps far in the past is problematic, since it violates invariants that other parts of the system rely on, such as MVCC immutability and closed timestamps. To avoid this, AddSSTable should in the typical case write at a current (given) HLC timestamp. However, rewriting the keys with a new timestamp is costly, and the whole point of AddSSTable is to efficiently write bulk data. To get around this, we should introduce support for synthesizing the timestamp of the added keys at read-time, and only rewrite them during Pebble compactions.

Implementation details are intentionally underspecified here, and need to be explored. See also the RFC in #69380 for further info. Functionality that relies on writing at historical timestamps will need to be updated separately with this new behavior.

Epic CRDB-2624

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-transactions Relating to MVCC and the transactional model. T-storage Storage Team labels Sep 20, 2021
jbowens added a commit to jbowens/pebble that referenced this issue Oct 1, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys that
have a suffix (as determined by Split) that exactly matches the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers will perform a block
transform to replace suffix placeholders with the replacement value on
the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 4, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 13, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 13, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 13, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 13, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 19, 2021
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.

This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.

When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.

After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.

Informs cockroachdb/cockroach#70422.
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 19, 2021

We'll need to figure out how this should interact with concurrent transactions and locks/intents. See relevant discussion in #71676 (comment). Also related to #71697.

@erikgrinaker erikgrinaker changed the title storage: MVCC timestamp synthesis for AddSSTable storage: AddSSTable option to write at current timestamp Oct 26, 2021
@erikgrinaker
Copy link
Contributor Author

The read-time timestamp synthesis ran into challenges because of the variable-length MVCC timestamp encoding, see cockroachdb/pebble#1314 (review). We're exploring rewriting the SST timestamps during AddSSTable request evaluation instead (prior to Raft and Pebble ingestion).

@erikgrinaker erikgrinaker changed the title storage: AddSSTable option to write at current timestamp kvserver: AddSSTable option to write at current timestamp Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants