Skip to content

Commit

Permalink
First draft of key layout using RocksDB
Browse files Browse the repository at this point in the history
  • Loading branch information
RustomMS committed Sep 20, 2022
1 parent dbc7827 commit d5fd913
Showing 1 changed file with 37 additions and 56 deletions.
93 changes: 37 additions & 56 deletions docs/storage.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,54 @@
# Storage

The storage portion will be responsible for data replication, sharding, transactions, and
physical data storage.
The storage portion will be responsible for data replication, sharding, transactions, and physical data storage.

## Storage API

A first-pass API for the storage layer might look something like this:
TBD

``` rust
/// Column indices pointing to the columns making up the primary key.
struct PrimaryKeyIndices<'a>(&[usize]);

/// A vector of value making up a row's primary key.
struct PrimaryKey(Vec<Value>);

type TableId = String;

pub trait Storage {
async fn create_table(&self, table: TableId) -> Result<()>;
async fn drop_table(&self, table: TableId) -> Result<()>;

/// Insert a row for some primary key.
async fn insert(&self, table: TableId, pk: PrimaryKeyIndices, row: Row) -> Result<()>;

/// Get a single row.
async fn get(&self, table: TableId, pk: PrimaryKey) -> Result<Option<Row>>;

/// Delete a row.
async fn delete(&self, table: TableId, pk: PrimaryKey) -> Result<()>;

/// Scan a table, with some optional arguments.
async fn scan(
&self,
table: TableId,
begin: Option<PrimaryKey>,
end: Option<PrimaryKey>,
filter: Option<ScalarExpr>,
projection: Option<Vec<ScalarExpr>>
) -> Result<Stream<DataFrame>>;
}
```

See the [DataFrame](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/df/mod.rs), [Value](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/value.rs), and [Scalar](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/expr/scalar.rs) modules for the relevant
definitions for `DataFrame`, `Row`, and `ScalarExpr` respectively. Note that
this API lacks a few things, including updating tables, and providing schema
information during table creates. The primary thing to point out with this API
is that it should use that same data representation as the query engine to
enable pushing down projections and filters as a much as possible (and other
things like aggregates in the future).
Requirements:
- Use Arrow like format as much as possible for easy
- Build with future compression in mind

## Physical Storage

We should initially use RocksDB for getting to a prototype quickly. The
tradeoff here is we might lose out on analytical performance, and lose
fine-grained control of reads/writes to and from the disk.
We should initially use RocksDB for getting to a prototype quickly. The tradeoff here is we might lose out on analytical performance, and lose fine-grained control of reads/writes to and from the disk.

We shall be using 2 column families.

Keys should look something like the following:

Default column family:
```
<table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<timestamp>
<tenant_id><table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<start transaction timestamp>
```

Breaking it down:
- **tenant_id(optional)**: The tenant identifier to enable having multiple cloud users/versions of a table (dev, staging, production) share storage
- **table_id**: The table identifier. This should remain constant across the
lifetime of the table, even if the table name is changed.
- **index_id**: The identifier of the index. `0` indicates the primary index,
and the value for a primary index should be the serialized row. Other index
IDs correspond to additional secondary indexes. Values in such cases should be
the composite primary key.
- **index_value_n**: The column value corresponding to the column in the index.
- **timestamp**: The transaction timestamp.
- **index_id**: The identifier of the index. `0` indicates the primary index, other values indicate the secondary indexes.
- **index_value_n**: The column value corresponding to the column in the index. This will be in serialized form
- **timestamp**: The timestamp for the start of the transaction (hybrid logical clock)
- End time stamp in separate column family (Transactional information)

- For the primary index the value will be the serialized row (for now in the datafusion_row::layout::RowType format)
- For the secondary indexes the values will be should the composite primary key (just the primary index values)
- The rest of the key is constructed from already known information. The timestamp from the current transaction
- Comparison for the keys is also important. Ideally we would want each start transaction timestamps to be descending. This way we see the most recent provisional write and latest value first and make a determination whether we should be using that. While still allowing prefix search to work appropriately

Transactional information column family:
```
<tenant_id>/<table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<start transaction timestamp>
```
- Same key as default column family
- Value will be the commit status and later the timestamp at which point the mutable operations done in the transaction are visible (i.e. committed)
- When the transaction is still in progress there is is no timestamp.

## Future considerations
- Is there a need to consider separate LSM trees per table or collection of tables within one logical database
- Should we further partition tables into separate LSM trees
- Cleanup of old versions (configurable to keep everything, default for now, though default should be shorter so this can scale with use in cloud product)
- Should index values in the key be in serialized form
- Should column values be stored concatenated per key
- Alternatively store in NSMish format with column id before timestamp/HCL
- Alternatively store in DSMish format with column id before index_id 0
- Should secondary indexes be in a separate column family

0 comments on commit d5fd913

Please sign in to comment.