near · frol · Mar 21, 2023 · Sep 29, 2022 · Sep 29, 2022 · Oct 2, 2022
@@ -25,6 +25,7 @@ Changes to the protocol specification and standards are called NEAR Enhancement
 |[0245](https://github.com/near/NEPs/blob/master/neps/nep-0245.md)   | Multi Token Standard | @zcstarr @riqi @jriemann @marcos.sun | Review |
 |[0297](https://github.com/near/NEPs/blob/master/neps/nep-0297.md)   | Events Standard | @telezhnaya | Final |
 |[0330](https://github.com/near/NEPs/blob/master/neps/nep-0330.md)   | Source Metadata | @BenKurrek | Review |
+|[0399](https://github.com/near/NEPs/blob/master/neps/nep-0399.md)   | Flat Storage | @AleksandrLogunov @MinZhang | Draft |
 
 
 

@@ -0,0 +1,340 @@
+---
+NEP: 0399
+Title: Flat Storage
+Author: Aleksandr Logunov <[email protected]> Min Zhang <[email protected]>
+DiscussionsTo: https://github.com/nearprotocol/neps/pull/0399
+Status: Draft
+Type: Protocol Track
+Category: Storage
+Created: 07-Sep-2022
+---
+
+## Summary
+
+Currently, the state of blockchain is stored in our storage in the format of persistent merkelized tries. 
+Although the trie structure is needed to compute state roots and prove the validity of states, it is expensive 
+to read from the trie structure because it requires a traversal from the trie root to the leaf that contains the key
+value pair, which could mean 2 * key_length of disk access in the worst case. 
+
+In addition, we charge receipts by the number of trie nodes they touched (TTN cost),
+which is both confusing to developers and unpredictable. This NEP proposes the idea of FlatStorage,
+which stores a flattened key/value pairs of the current state on disk. This way, any storage read requires at most
+2 disk reads. As a result, we can make storage reads faster, decrease the fees, and get rid of the TTN
+cost.
+
+## Motivation
+
+The motivation of this proposal is to increase performance of storage reads, reduce storage read costs and 
+simplifies how storage fees are charged by getting rid of TTN cost.
+
+## Rationale and alternatives
+
+Q: Why is this design the best in the space of possible designs?
+
+A: There are other ideas for how to improve storage performance, such as using 
+  other database instead of rocksdb, or changing the representation of states
+  to achieve locality of data in the same account. Considering that these ideas 
+  will likely require much more work than FlatStorage, FlatStorage is a good investment
+  of our effort to achieve better storage performances. In addition, the improvement 
+  from FlatStorage can be combined with the improvement brought by these other ideas,
+  so the implementation of FlatStorage won't be rendered obsolete in the future.
+
+Q: What other designs have been considered and what is the rationale for not choosing them?
+
+A: Alternatively, we can still get rid of TTN cost by increasing the base fees for storage reads and writes. However, 
+  this could require increasing the fees by quite a lot, which could end up breaking many contracts.
+Q: What is the impact of not doing this?
+
+A: Storage reads will remain inefficiently implemented and cost more than it could be.
+
+## Specification
+The key idea of FlatStorage is to store a direct mapping from trie keys to values on disk. 
+Here the values of this mapping can be either the value corresponding to the trie key itself, 
+or the value ref, a hash that points to the address of the value. If the value itself is stored, 
+only one disk read is needed to look up a value from flat storage, otherwise two disk reads if the value 
+ref is stored. We will discuss more in the following section for whether we use values or value refs. 
+For the purpose of high level discussion, it suffices to say that with FlatStorage, 
+at most two disk reads are needed to perform a storage read.
+
+The simple design above won't work because there could be forks in the chain. In the following case, FlatStorage 
+must support key value lookups for states of the blocks on both forks.
+```
+           Block B1 - Block B2 - ...
+        / 
+block A 
+        \  Block C1 - Block C2 - ...
+```
+
+The handling of forks will be the main consideration of the following design. More specifically,
+the design should satisfy the following requirements,
+1) It should support concurrent block processing. Blocks on different forks are processed 
+   concurrently in our client code, so the flat storage API must support that.
+2) In case of long forks, block processing time should not be too much longer than the average case. 
+   We don’t want this case to be exploitable. It is acceptable that block processing time is 200ms longer, 
+   which may slow down block production, but probably won’t cause missing blocks and chunks. 
+   It is not acceptable if block processing time is 10s, which may lead to more forks and instability in the network.
+3) The design must be able to decrease storage access cost in all cases, 
+   since we are going to change the storage read fees based on flat storage. 
+   We can't conditionally enable FlatStorage for some blocks and disable it for other, because 
+   the fees we charge must be consistent. 
+
+The mapping of key value pairs FlatStorage stored on disk matches the state at some block. 
+We call this block the head of flat storage, or the flat head. There are two natural options 
+for which block should be the flat head, the chain head, or the last final block. Although
+both options could work, the implementation we propose will use the last final block because 
+it is simpler.
+
+To support key value lookups for other blocks that are not the flat head, FlatStorage will 
+also store key value changes(deltas) per block for these blocks.  
+We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head is at block h, 
+and we are applying transactions based on block h’. Since h is the last final block, 
+h is an ancestor of h'. To access the state at block h', we need FSDs of all blocks between h and h'.
+Note that all these FSDs must be stored in memory, otherwise, the access of FSDs will trigger 
+more disk reads and we will have to set storage key read fee higher. 
+
+However, the consensus algorithm doesn’t provide any guarantees in the distance of blocks 
+that we need to process since it could be arbitrarily long for a block to be finalized. 
+To solve this problem, we make another proposal (TODO: attach link for the proposal) to 
+set gas limit to zero for blocks with height larger than the latest final block’s height + X. 
+If the gas limit is set to zero for a block, it won't contain any transactions or receipts, 
+and FlatStorage won't need to store the delta for this block.
+With this change, FlatStorage only needs to store FSDs for blocks with height less than the latest 
+final block’s height + X. And since there can be at most one valid block per height, 
+FlatStorage only needs to store at most X FSDs in memory. 
+
+### FSD size estimation
+To set the value of X, we need to see how many block deltas can fit in memory.
+
+We can estimate FSD size using protocol fees. 
+Assume that flat state stores a mapping from keys to value refs. 
+Maximal key length is ~2 KiB which is the limit of contract data key size. 
+During wasm execution, we pay `wasm_storage_write_base` = 64 Ggas per call and 
+`wasm_storage_write_key_byte` = 70 Mgas per key byte. 
+In the extreme case it means that we pay `(64_000 / 2 KiB + 70) Mgas ~= 102 Mgas` per byte. 
+Then the total size of keys changed in a block is at most 
+`block_gas_limit / gas_per_byte * num_shards = (1300 Tgas / 102 Mgas) * 4 ~= 50 MiB`.
+
+To estimate the sizes of value refs, there will be at most 
+`block_gas_limit / wasm_storage_write_base * num_shards
+= 1300 Tgas / 64 Ggas * 4 = 80K` changed entries in a block. 
+Since one value ref takes 40 bytes, limit of total size of changed value refs in a block 
+is then 3.2 MiB. 
+
+To sum it up, we will have < 54 MiB for one block, and ~1.1 GiB for 20 blocks.
+
+Note that if we store a value instead of value ref, size of FSDs can potentially be much larger.
+Because value limit is 4 MiB, we can’t apply previous argument about base cost.
+Since `wasm_storage_write_value_byte` = 31 Mgas, one FSD size can be estimated as 
+`(1300 Tgas / min(storage_write_value_byte, storage_write_key_byte) * num_shards)`, or ~170 MiB,
+which is 3 times higher.
+
+The advantage of storing values instead of value refs is that it saves one disk read if the key has been 
+modified in the recent blocks. It may be beneficial if we get many transactions or receipts touching the same 
+trie keys in consecutive blocks, but it is hard to estimate the value of such benefits without more data. 
+Since storing values will cost much more memory than value refs, we will likely choose to store value refs
+in FSDs and set X to a value between 10 and 20.
+
+### Storage Writes
+Currently, storage writes are charged based on the number of touched trie nodes (TTN cost), because updating the leaf trie
+node which stores the value to the trie key requires updating all trie nodes on the path leading to the leaf node. 
+All writes are committed at once in one db transaction at the end of block processing, outside of runtime after 
+all receipts in a block are executed. However, at the time of execution, runtime needs to calculate the cost, 
+which means it needs to know how many trie nodes the write affects, so runtime will issue a read for every write
+to calculate the TTN cost for the write. Such reads cannot be replaced by a read in FlatStorage because FlatStorage does
+not provide the path to the trie node. 
+
+There are multiple proposals on how storage writes can work with FlatStorage. 
+- Keep it the same. The cost of writes remain the same. Note that this can increase the cost for writes in 
+  some cases, for example, if a contract first read from a key and then writes to the same key in the same chunk.
+  Without FlatStorage, the key will be cached in the chunk cache after the read, so the write will cost less. 
+  With FlatStorage, the read will go through FlatStorage, the write will not find the key in the chunk cache and 
+  it will cost more.
+- Remove the TTN cost from storage write fees. Currently, there are two ideas in this direction.
+    - Charge based on maximum depth of a contract’s state, instead of per-touch-trie node.
+    - Charge based on key length only.
+
+      Both of the above ideas would allow us to remove writes from the critical path of block execution. However, 
+    it is unclear at this point what the new cost would look like and whether further optimizations are needed
+      to bring down the cost for writes in the new cost model. 
+
+### Migration Plan
+// TODO 
+
+## Reference Implementation 
+FlatStorage will implement the following structs.
+
+`FlatStateDelta`: a HashMap that contains state changes introduced in a block. They can be applied
+on top the state at flat head to compute state at another block.
+
+`FlatState`: provides an interface to get value or value references from flat storage. It 
+             will be part of `Trie`, and all trie reads will be directed to the FlatState object. 
+             A `FlatState` object is based on a block `block_hash`, and it provides key value lookups
+             on the state after the block `block_hash` is applied.
+
+`ShardFlatStates`: provides an interface to construct `FlatState` for each shard.
+
+`FlatStorageState`: stores information about the state of the flat storage itself,
+                    for example, all block deltas that are stored in flat storage and the flat 
+                    storage head. `FlatState` can access `FlatStorageState` to get the list of 
+                    deltas it needs to apply on top of state of current flat head in order to 
+                    compute state of a target block.
+
+It may be noted that in this implementation, a separate `FlatState` and `FlatStorageState` 
+will be created for each shard. The reason is that there are two modes of block processing, 
+normal block processing and block catchups. 
+Since they are performed on different ranges of blocks, flat storage need to be able to support
+different range of blocks on different shards. Therefore, we separate the flat storage objects
+used for different shards.
+
+### DB columns
+`DBCol::FlatState` stores a mapping from trie keys to the value corresponding to the trie keys,
+based on the state of the block at flat storage head.
+- *Rows*: trie key (`Vec<u8>`)
+- *Column type*: `ValueOrValueRef`
+
+`DBCol::FlatStateDeltas` stores a mapping from `(shard_id, block_hash)` to the `FlatStateDelta` that stores
+state changes introduced in the given shard of the given block.
+- *Rows*: `{ shard_id, block_hash }`
+- *Column type*: `FlatStateDelta`
+Note that `FlatStateDelta`s needed are stored in memory, so during block processing this column won't be used
+  at all. This column is only used to load deltas into memory at `FlatStorageState` initialization time when node starts. 
+
+`DBCol::FlatStateHead` stores the flat head at different shards. 
+- *Rows*: `shard_id`
+- *Column type*: `CryptoHash`
+Similarly, flat head is also stored in `FlatStorageState` in memory, so this column is only used to initialize
+  `FlatStorageState` when node starts.
+
+### `FlatStateDelta`
+`FlatStateDelta` stores a mapping from trie keys to value refs. If the value is `None`, it means the key is deleted
+in the block.
+```rust
+pub struct FlatStateDelta(HashMap<Vec<u8>, Option<ValueRef>>);
+```
+
+```rust
+pub fn from_state_changes(changes: &[RawStateChangesWithTrieKey]) -> FlatStateDelta
+```
+Converts raw state changes to flat state delta. The raw state changes will be returned as part of the result of 
+`Runtime::apply_transactions`. They will be converted to `FlatStateDelta` to be added
+to `FlatStorageState` during `Chain::post_processblock`.
+
+### ```FlatState```
+```FlatState``` will be created for a shard `shard_id` and a block `block_hash`, and it can perform 
+key value lookup for the state of shard `shard_id` after block `block_hash` is applied.
+```rust
+pub struct FlatState {
+/// Used to access flat state stored at the head of flat storage.
+store: Store,
+/// The block for which key-value pairs of its state will be retrieved. The flat state
+/// will reflect the state AFTER the block is applied.
+block_hash: CryptoHash,
+/// In-memory cache for the key value pairs stored on disk.
+#[allow(unused)]
+cache: FlatStateCache,
+/// Stores the state of the flat storage
+#[allow(unused)]
+flat_storage_state: FlatStorageState,
+}
+```
+
+```FlatState``` will provide the following interface.
+```rust
+pub fn get_ref(
+    &self,
+    key: &[u8],
+) -> Result<Option<ValueOrValueRef>, StorageError>
+```
+Returns the value or value reference corresponding to the given `key`
+for the state that this `FlatState` object represents, i.e., the state that after
+block `self.block_hash` is applied.
+
+`FlatState` will be stored as a field in `Tries`.
+
+###```ShardFlatStates```
+`ShardFlatStates` will be stored as part of `ShardTries`. Similar to how `ShardTries` is used to 
+construct new `Trie` objects given a state root and a shard id, `ShardFlatStates` is used to construct
+a new `FlatState` object given a block hash and a shard id. 
+
+```rust
+pub fn new_flat_state_for_shard(
+    &self,
+    shard_id: ShardId,
+    block_hash: Option<CryptoHash>,
+) -> FlatState
+```
+Creates a new `FlatState` to be used for performing key value lookups on the state of shard `shard_id`
+after block `block_hash` is applied.
+
+```rust
+pub fn get_flat_storage_state_for_shard(
+    &self,
+    shard_id: ShardId,
+) -> Result<FlatStorageState, FlatStorageError>
+```
+Returns the `FlatStorageState` for the shard `shard_id`. This function is needed because even though 
+`FlatStorageState` is part of `Runtime`, `Chain` also needs access to `FlatStorageState` to update flat head. 
+We will also create a function with the same in `Runtime` that calls this function to provide `Chain` to access
+to `FlatStorageState`.
+
+###```FlatStorageState```
+`FlatStorageState` is created per shard. It provides information to which blocks the flat storage 
+on the given shard currently supports and what block deltas need to be applied on top the stored
+flat state on disk to get the state of the target block. 
+
+```rust
+fn get_deltas_between_blocks(
+    &self,
+    target_block_hash: &CryptoHash,
+) -> Result<Vec<Arc<FlatStateDelta>>, FlatStorageError>
+```
+Returns the list of deltas between blocks `target_block_hash`(inclusive) and flat head(exclusive), 
+Returns an error if `target_block_hash` is not a direct descendent of the current flat head.
+This function will be used in `FlatState::get_ref`.
+
+```rust
+fn update_flat_head(&self, new_head: &CryptoHash) -> Result<(), FlatStorageError>
+```
+Updates the head of the flat storage, including updating the flat head in memory and on disk,
+update the flat state on disk to reflect the state at the new head, and gc the `FlatStateDelta`s that 
+are no longer needed from memory and from disk. 
+
+```rust
+fn add_delta(
+    &self,
+    block_hash: &CryptoHash,
+    delta: FlatStateDelta,
+) -> Result<StoreUpdate, FlatStorageError>
+```
+Adds `delta` to `FlatStorageState`, returns a `StoreUpdate` object that includes 
+
+#### Thread Safety
+We should note that the implementation of `FlatStorageState` must be thread safe because it can 
+be concurrently accessed by multiple threads. A node can process multiple blocks at the same time
+if they are on different forks. Therefore, `FlatStorageState` will be guarded by a `RwLock` so its
+access can be shared safely.
+
+```rust
+pub struct FlatStorageState(Arc<RwLock<FlatStorageStateInner>>);
+```
+
+## Drawbacks (Optional)
+
+Why should we *not* do this?
+
+## Unresolved Issues
+
+As we discussed in Section Specification, there are still unanswered questions around how the new cost model for storage
+writes would look like and how the current storage can be upgraded to enabled FlatStorage. We expect to finalize 
+the migration plan before this NEP gets merged, but we might need more time to collect data and measurement around
+storage write costs, which can be only be collected after FlatStorage is partially implemented. 
+
+Another big unanswered question is how FlatStorage would work when challenges are enabled. We consider that to be out of 
+the scope of this NEP because the details of how challenges will be implemented are not clear yet. 
+
+## Future possibilities
+
+## Copyright
+[copyright]: #copyright
+
+Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).