From d5fd9131e2e1aa1bc3aeac03da40ac9cad6cd175 Mon Sep 17 00:00:00 2001
From: Rustom Shareef <rustom@glaredb.com>
Date: Mon, 19 Sep 2022 17:35:55 -0400
Subject: [PATCH] First draft of key layout using RocksDB

---
 docs/storage.md | 93 ++++++++++++++++++++-----------------------------
 1 file changed, 37 insertions(+), 56 deletions(-)

diff --git a/docs/storage.md b/docs/storage.md
index 80ea64674..991cc2b4a 100644
--- a/docs/storage.md
+++ b/docs/storage.md
@@ -1,73 +1,54 @@
 # Storage
 
-The storage portion will be responsible for data replication, sharding, transactions, and
-physical data storage.
+The storage portion will be responsible for data replication, sharding, transactions, and physical data storage.
 
 ## Storage API
 
-A first-pass API for the storage layer might look something like this:
+TBD
 
-``` rust
-/// Column indices pointing to the columns making up the primary key.
-struct PrimaryKeyIndices<'a>(&[usize]);
-
-/// A vector of value making up a row's primary key.
-struct PrimaryKey(Vec<Value>);
-
-type TableId = String;
-
-pub trait Storage {
-    async fn create_table(&self, table: TableId) -> Result<()>;
-    async fn drop_table(&self, table: TableId) -> Result<()>;
-
-    /// Insert a row for some primary key.
-    async fn insert(&self, table: TableId, pk: PrimaryKeyIndices, row: Row) -> Result<()>;
-
-    /// Get a single row.
-    async fn get(&self, table: TableId, pk: PrimaryKey) -> Result<Option<Row>>;
-
-    /// Delete a row.
-    async fn delete(&self, table: TableId, pk: PrimaryKey) -> Result<()>;
-
-    /// Scan a table, with some optional arguments.
-    async fn scan(
-        &self,
-        table: TableId,
-        begin: Option<PrimaryKey>,
-        end: Option<PrimaryKey>,
-        filter: Option<ScalarExpr>,
-        projection: Option<Vec<ScalarExpr>>
-    ) -> Result<Stream<DataFrame>>;
-}
-```
-
-See the [DataFrame](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/df/mod.rs), [Value](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/value.rs), and [Scalar](https://github.com/GlareDB/glaredb/blob/3577409682122ce046709ae93512499da7253fb7/crates/lemur/src/repr/expr/scalar.rs) modules for the relevant
-definitions for `DataFrame`, `Row`, and `ScalarExpr` respectively. Note that
-this API lacks a few things, including updating tables, and providing schema
-information during table creates. The primary thing to point out with this API
-is that it should use that same data representation as the query engine to
-enable pushing down projections and filters as a much as possible (and other
-things like aggregates in the future).
+Requirements:
+- Use Arrow like format as much as possible for easy
+- Build with future compression in mind
 
 ## Physical Storage
 
-We should initially use RocksDB for getting to a prototype quickly. The
-tradeoff here is we might lose out on analytical performance, and lose
-fine-grained control of reads/writes to and from the disk.
+We should initially use RocksDB for getting to a prototype quickly.  The tradeoff here is we might lose out on analytical performance, and lose fine-grained control of reads/writes to and from the disk.
+
+We shall be using 2 column families.
 
 Keys should look something like the following:
 
+Default column family:
 ```
-<table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<timestamp>
+<tenant_id><table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<start transaction timestamp>
 ```
-
-Breaking it down:
+- **tenant_id(optional)**: The tenant identifier to enable having multiple cloud users/versions of a table (dev, staging, production) share storage
 - **table_id**: The table identifier. This should remain constant across the
   lifetime of the table, even if the table name is changed.
-- **index_id**: The identifier of the index. `0` indicates the primary index,
-  and the value for a primary index should be the serialized row. Other index
-  IDs correspond to additional secondary indexes. Values in such cases should be
-  the composite primary key.
-- **index_value_n**: The column value corresponding to the column in the index.
-- **timestamp**: The transaction timestamp.
+- **index_id**: The identifier of the index. `0` indicates the primary index, other values indicate the secondary indexes.
+- **index_value_n**: The column value corresponding to the column in the index. This will be in serialized form
+- **timestamp**: The timestamp for the start of the transaction (hybrid logical clock)
+- End time stamp in separate column family (Transactional information)
 
+- For the primary index the value will be the serialized row (for now in the datafusion_row::layout::RowType format)
+- For the secondary indexes the values will be should the composite primary key (just the primary index values)
+    - The rest of the key is constructed from already known information. The timestamp from the current transaction
+- Comparison for the keys is also important. Ideally we would want each start transaction timestamps to be descending. This way we see the most recent provisional write and latest value first and make a determination whether we should be using that. While still allowing prefix search to work appropriately
+
+Transactional information column family:
+```
+<tenant_id>/<table_id>/<index_id>/<index_value_1>/<index_value_2>/.../<start transaction timestamp>
+```
+- Same key as default column family
+- Value will be the commit status and later the timestamp at which point the mutable operations done in the transaction are visible (i.e. committed)
+- When the transaction is still in progress there is is no timestamp.
+
+## Future considerations
+- Is there a need to consider separate LSM trees per table or collection of tables within one logical database
+    - Should we further partition tables into separate LSM trees
+- Cleanup of old versions (configurable to keep everything, default for now, though default should be shorter so this can scale with use in cloud product)
+- Should index values in the key be in serialized form
+- Should column values be stored concatenated per key
+    - Alternatively store in NSMish format with column id before timestamp/HCL
+    - Alternatively store in DSMish format with column id before index_id 0
+- Should secondary indexes be in a separate column family