Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of key layout using RocksDB #116

Merged
merged 2 commits into from
Sep 20, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 42 additions & 56 deletions docs/storage.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,59 @@
# Storage

The storage portion will be responsible for data replication, sharding, transactions, and
physical data storage.
The storage portion will be responsible for data replication, sharding, transactions, and physical data storage.

## Storage API

A first-pass API for the storage layer might look something like this:
TBD

``` rust
/// Column indices pointing to the columns making up the primary key.
struct PrimaryKeyIndices<'a>(&[usize]);

/// A vector of value making up a row's primary key.
struct PrimaryKey(Vec<Value>);

type TableId = String;

pub trait Storage {
async fn create_table(&self, table: TableId) -> Result<()>;
async fn drop_table(&self, table: TableId) -> Result<()>;

/// Insert a row for some primary key.
async fn insert(&self, table: TableId, pk: PrimaryKeyIndices, row: Row) -> Result<()>;

/// Get a single row.
async fn get(&self, table: TableId, pk: PrimaryKey) -> Result<Option<Row>>;

/// Delete a row.
async fn delete(&self, table: TableId, pk: PrimaryKey) -> Result<()>;

/// Scan a table, with some optional arguments.
async fn scan(
&self,
table: TableId,
begin: Option<PrimaryKey>,
end: Option<PrimaryKey>,
filter: Option<ScalarExpr>,
projection: Option<Vec<ScalarExpr>>
) -> Result<Stream<DataFrame>>;
}
```

See the [DataFrame](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/df/mod.rs), [Value](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/value.rs), and [Scalar](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/expr/scalar.rs) modules for the relevant
definitions for `DataFrame`, `Row`, and `ScalarExpr` respectively. Note that
this API lacks a few things, including updating tables, and providing schema
information during table creates. The primary thing to point out with this API
is that it should use that same data representation as the query engine to
enable pushing down projections and filters as a much as possible (and other
things like aggregates in the future).
Requirements:
- Use Arrow like format as much as possible for easy
- Build with future compression in mind

## Physical Storage

We should initially use RocksDB for getting to a prototype quickly. The
tradeoff here is we might lose out on analytical performance, and lose
fine-grained control of reads/writes to and from the disk.
We should initially use RocksDB for getting to a prototype quickly. The tradeoff here is we might lose out on analytical performance, and lose fine-grained control of reads/writes to and from the disk.

We shall be using 2 column families.

Keys should look something like the following:

Default column family:
```
<table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<timestamp>
<tenant_id><table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<start transaction timestamp>
```

Breaking it down:
- **tenant_id(optional)**: The tenant identifier to enable having multiple cloud users/versions of a table (dev, staging, production) share storage
- **table_id**: The table identifier. This should remain constant across the
lifetime of the table, even if the table name is changed.
- **index_id**: The identifier of the index. `0` indicates the primary index,
and the value for a primary index should be the serialized row. Other index
IDs correspond to additional secondary indexes. Values in such cases should be
the composite primary key.
- **index_value_n**: The column value corresponding to the column in the index.
- **timestamp**: The transaction timestamp.
- **index_id**: The identifier of the index. `0` indicates the primary index, other values indicate the secondary indexes.
- **index_value_n**: The column value corresponding to the column in the index. This will be in serialized form
- **timestamp**: The timestamp for the start of the transaction (hybrid logical clock)
- End time stamp in separate column family (Transactional information)

- For the primary index the value will be the serialized row (for now in the datafusion_row::layout::RowType format)
- For the secondary indexes the values will be should the composite primary key (just the primary index values)
- The rest of the key is constructed from already known information. The timestamp from the current transaction
- Comparison for the keys is also important. Ideally we would want each start transaction timestamps to be descending. This way we see the most recent provisional write and latest value first and make a determination whether we should be using that. While still allowing prefix search to work appropriately

Transactional information column family:
```
<tenant_id>/<table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<start transaction timestamp>
```
- Same key as default column family
- Value will be the commit status and later the timestamp at which point the mutable operations done in the transaction are visible (i.e. committed)
- When the transaction is still in progress there is is no timestamp.

## Future considerations
- Is there a need to consider separate LSM trees per table or collection of tables within one logical database
- Should we further partition tables into separate LSM trees
- Cleanup of old versions (configurable to keep everything, default for now, though default should be shorter so this can scale with use in cloud product)
- To start we can have 2 configs:
- Keep everything
- Keep a week (or max transaction length, which can default to a week)
- Should index values in the key be in serialized form
- Should column values be stored concatenated per key
- Alternatively store in NSMish format with column id before timestamp/HCL
- Alternatively store in DSMish format with column id before index_id 0
- Should secondary indexes be in a separate column family
- Investigate use of manual compaction to upload sst files to object store
- Restoring db from object store sst files