Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR-040: Storage and SMT State Commitments #8430

Merged
merged 24 commits into from
May 11, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
11728cf
ADR-040: Storage and SMT State Commitments
robert-zaremba Jan 25, 2021
662ec91
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
robert-zaremba Jan 26, 2021
5fdbe5d
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
robert-zaremba Jan 26, 2021
fa8e9e3
Added more details for snapshotting and pruning.
robert-zaremba Jan 26, 2021
864927e
updated links and references
robert-zaremba Jan 26, 2021
78215b2
add blockchains which already use SMT
robert-zaremba Jan 26, 2021
6dd0323
reorganize versioning and pruning
robert-zaremba Jan 27, 2021
250b5ff
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
robert-zaremba Jan 29, 2021
374916f
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
robert-zaremba Jan 29, 2021
e90bf8a
adding a paragraph about state management
robert-zaremba Jan 29, 2021
8602b3e
adr-40: update 'accessing old state' section
robert-zaremba Feb 25, 2021
ca39df5
Merge branch 'master' into robert/adr-040
robert-zaremba Apr 23, 2021
aedce21
update based on all recent discussions and validations
robert-zaremba Apr 23, 2021
f704279
adding more explanation about KV interface
robert-zaremba Apr 27, 2021
06d1952
Merge branch 'master' into robert/adr-040
robert-zaremba Apr 27, 2021
7537c84
Apply suggestions from code review
robert-zaremba Apr 28, 2021
1cc123e
Apply suggestions from code review
robert-zaremba Apr 28, 2021
d321dac
review comments
robert-zaremba Apr 28, 2021
80d0122
adding paragraph about commiting to an object without storying it
robert-zaremba Apr 28, 2021
962a28b
review updates
robert-zaremba Apr 30, 2021
bb89798
Apply suggestions from code review
robert-zaremba May 5, 2021
19d2126
review udpates
robert-zaremba May 5, 2021
356f987
adding clarification
robert-zaremba May 7, 2021
42e7f08
Merge branch 'master' into robert/adr-040
robert-zaremba May 11, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,5 @@ Read about the [PROCESS](./PROCESS.md).
- [ADR 028: Public Key Addresses](./adr-028-public-key-addresses.md)
- [ADR 032: Typed Events](./adr-032-typed-events.md)
- [ADR 035: Rosetta API Support](./adr-035-rosetta-api-support.md)
- [ADR 037: Governance Split Votes](./adr-037-gov-split-vote.md)
- [ADR 037: Governance Split Votes](./adr-037-gov-split-vote.md)
- [ADR 040: Storage and SMT State Commitments](./adr-040-storage-and-smt-state-commitments.md)
129 changes: 129 additions & 0 deletions docs/architecture/adr-040-storage-and-smt-state-commitments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# ADR 040: Storage and SMT State Commitments

## Changelog

- 2020-01-15: Draft

## Status

DRAFT Not Implemented


## Abstract

Sparse Merke Tree (SMT) is a version of a Merkle Tree with various storage and performance optimizations. This ADR defines a separation of state commitments from data storage and the SDK transition from IAVL to SMT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this shouldn't in fact be two ADRs instead? One for separating storage and commitments and one about the SMT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about it. But they are highly related - one cannot be done without other. Hence, I'm proposing here a general design and leave a space for future ADR for RDMS which will introduce SDK breaking changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we could separate the two with IAVL right? We don't need SMT for that AFAIK...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronc, we could describe here only SMT, but it will only a half backed idea without a working solution:

  • keeping IAVL (in it's current implementation) with anything else doesn't make sense because we double the data.
  • the main value proposition here is to not store objects in SMT (we store only hashes).

Do you have something else in mind?

robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved


## Context

Currently, Cosmos SDK uses IAVL for both state commitments and data storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would define what state commitments are and how it differs from data storage. It can be concise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it self explaining? State commitment is a commitment to a state. I can add a link to explain more general commitment schemes.


IAVL has effectively become an orphaned project within the Cosmos ecosystem and it's proven to be an inefficient state commitment.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved
In the current design, IAVL is used for both data storage and as a Merkle Tree for state commitments. IAVL is meant to be a standalone Merkelized key/value database, however it's using a KV DB engine to store all tree nodes. So, each node is stored in a separate record in the KV DB. This causes many inefficiencies and problems:

+ Each object select requires a tree traversal from the root
+ Each edge traversal requires a DB query (nodes are not stored in a memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when traversing, we a tree we are always doing a DB query. However subsequent queries are cached on SDK level, not the IAVL level. I can add that calcification.

+ Creating snapshots is [expensive](https://github.com/cosmos/cosmos-sdk/issues/7215#issuecomment-684804950). It takes about 30 seconds to export less than 100 MB of state (as of March 2020).
+ Updates in IAVL may trigger tree reorganization and possible O(log(n)) hashes re-computation, which can become a CPU bottleneck.
+ The leaf structure is pretty expensive: it contains the `(key, value)` pair, additional metadata such as height, version. The entire node is hashed, and that hash is used as the key in the underlying database, [ref](https://github.com/cosmos/iavl/blob/master/docs/node/node.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on why it's "expensive".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It contains lot of data, which is not needed in the new structure. We don't really need the metadata in the new structure.

).


Moreover, the IAVL project lacks support and a maintainer and we already see better and well-established alternatives. Instead of optimizing the IAVL, we are looking into other solutions for both storage and state commitments.


## Decision

We propose separate the concerns of state commitment (**SC**), needed for consensus, and state storage (**SS**), needed for state machine. Finally we replace IAVL with [LazyLedger SMT](https://github.com/lazyledger/smt). LazyLedger SMT is based on Diem (called jellyfish) design [*] - it uses a compute-optimised SMT by replacing subtrees with only default values with a single node (same approach is used by Ethereum2 as well).
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved


### Decouple state commitment from storage

Separation of storage and commitment (by the SMT) will allow to optimize the different components according to their usage and access patterns.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

SMT will use it's own storage (could use the same database underneath) from the state machine store. For every `(key, value)` pair, the SMT will store `hash(key)` in a path and `hash(key, value)` in a leaf.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to apply these changes to the IAVL implementation which would remove the state duplication from the implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope of the refactor. It can be done after this upgrade has been completed and if someone asks for it, otherwise we would look at archiving IAVL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @marbar3778 . IAVL has other drawbacks, and no point to update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really great to understand better why we're storing hash(key, value) in the leaf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in the design I put forth in #9158 and #9156, I was thinking we might store hash(table_name, primary_key) as the key and hash(proto_json(value)) as the value. value in this case actually includes the primary key but not the table name. Is there a way we can just expose a generic merkle KVStore and just allow modules to choose what they do and don't put in the Merkle store?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We don't have a concept of table_name on this level. And we don't need it. We don't have things like primary key neither. We operate on Key-Value level.
  2. We are hashing a key to distribute evenly the keys. in the tree.
  3. Value is already a binary, so proto_json doesn't make sense here. The storage doesn't know anything about the value structure - it's just a sequence of bytes.

Is there a way we can just expose a generic merkle KVStore and just allow modules to choose what they do and don't put in the Merkle store?

This is what we are doing. Modules don't even know if there is a merkle tree, and what goes into the merkle tree. Modules only use a generic KVStore interface as it's done today (with caching and key prefixing).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we're storing hash(key, value) in the leaf.

We want to bind a (key-value) pair from the state storage into the sate commitment. Since we don't know if key is part of a value, we just hash both. It's fast. I will add this explanation to the document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more or less expressing my desire for two methods on sdk.Context: one would return a merkle store directly and the other would return a regular non-merklized KV store. If I'm thinking about what would make proofs easy for clients, I would structure the merklized data using a well-defined specification AND not merklize secondary indexes.

The reason I mention proto JSON is because for SIGN_MODE_TEXTUAL clients will need an algorithm to "canonicalize" proto3 JSON. Clients do not have an algorithm for canonicalizing proto binary, modules do not store proto binary in canonical form, and there haven't really been substantive discussions about changing this. So it is not so easy for clients to verify a proof it they don't have the exact binary that is represented in the KV store which is not well-defined and probably shouldn't be a detail that is exposed to clients. So I'm proposing that the merklized data follow some well-defined spec maybe using textual formats rather than relying on what goes into the KV store.

Rather than specifying this at the framework level, my solution would be for sdk.Context to all direct access to the merkle tree and non-merklized KV store separately. Maybe this is out of scope for this ADR but I would like us to consider this and as I recall we discussed something like this in the past.

Copy link
Collaborator Author

@robert-zaremba robert-zaremba Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one would return a merkle store directly and the other would return a regular non-merklized KV store.

This ADR is not about introducing a storage for data not being part of the state commitment. The reason we have 2 data store (SS and SC) under the hood is for efficiency and was inspired by turbo geth. In other words, in this design we have only one external store, which commits and queries committed data. Under the hood, it uses 2 DBs for efficiency.

Support storage (eg module off chain store) or indexers are out of the scope and are not part of the committed state. We could implement an extension store which will use the state commit store (this ADR) in some way (eg: kind of a subtree, or polynomial commitment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I mention proto JSON is because for SIGN_MODE_TEXTUAL

It's not clear to me why storage should deal with additional logic (eg what's the data type), rather then bytes. If client want's to save data using SIGN_MODE_TEXTUAL, then it will use proto_json, or wrap the bytes with some hint, eg: {data: []byte, decoder: DecoderEnum} or just know it from the context (the client know what it is saving under each key, so should be able to decide if it needs to recode data: read -> deserialize(&struct) -> serialize_text_mode(&struct).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add notes about off-chain store to Further Discussion section.


For data access we propose 2 additional KV buckets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a KV bucket here? this may be nomenclature I am not familiar with

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some KV databases use buckets for creating different databases under the same server / engine. Postgresql will call it databases (you can have multiple databases in single Postgresql instance). RocksDB calls it column family.
I will add few words to explain it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you post this link with a small explainer. The current explainer doesn't explain, it just throws a sentence into the mix

1. B1: `key → value`: the principal object storage, used by a state machine, behind the SDK `KVStore` interface: provides direct access by key and allows prefix iteration (KV DB backend must support it).
2. B2: `hash(key, value) → key`: an index needed to extract a value (through: B2 -> B1) having a only a Merkle Path. Recall that SMT will store `hash(key, value)` in it's leafs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the need for the reverse index? I'm just wondering what the use case is. I'm imagining mostly we will have (key, value) already from a non-proof query and just want to use SMT to get the proof in which case we don't need the reverse index.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have it then you will always need to know a key and all proofs will need to include key (so will potentially save space in transactions). I looked at ics23 - both key and value is included in ics23 proofs. So maybe we can mark it as optional. I added it here for integrity.

3. we could use more buckets to optimize the app usage if needed.

Above, we propose to use KV DB. However, for state machine we could use RDBMS, which we discuss below.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved


### Requirements

State Storage requirements:
+ range queries
+ quick (key, value) access
+ creating a snapshot
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved
+ prunning (garbage collection)
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

State Commitment requirements:
+ fast updates
+ path length should be short
+ creating a snapshot
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved
+ pruning (garbage collection)


### LazyLedger SMT for State Commitment

A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intractable size. The assumption here is that as the size of the tree is intractable, there would only be a few leaf nodes with valid data blocks relative to the tree size, rendering the tree as sparse.


### Snapshots

One of the Stargate core features are snapshots and fast sync. Currently this feature is implemented through IAVL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a way to use snapshots/versions with BoltDB API. BoltDB provides "snapshot isolation", but do not support explicit creation/usage of snapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We should double-check that the dbs mentioned below actually provide the desired snapshotting mechanism.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. I'm not completely sure. We can check with the hashicorp team. I found this:

  • "If a long running read transaction (for example, a snapshot transaction) is needed, you might want to set DB.InitialMmapSize to a large enough value to avoid potential blocking of write transaction. " source
  • This sovled issue could be relevant, because waypoint is using boltDB and it support snapshots: Server Snapshot/Restore hashicorp/waypoint#682

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should spend a time here to discuss the backend. I mean - we need to verify the proposals, but we should have a separate discussion about the the "recommended" DB backend.
So, what do you think about shifting the snapshot functionality to DB?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if using database snapshots to access old state roots is overkill, or even necessary for SMT.

By default, you can already access the old state roots in the LazyLedger SMT implementation, because the tree isn't garbage collected currently. Once garbage collection is added, it could be configured to only garbage collect state roots older than a certain version, which would be equivalent to snapshots, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If a long running read transaction (for example, a snapshot transaction) is needed, you might want to set DB.InitialMmapSize to a large enough value to avoid potential blocking of write transaction. "

I saw this before, and I'm pretty sure it's about app-level logic executing app-level transaction (iteration over entire KV-store) to generate app-level snapshots.

I'm wondering if using database snapshots to access old state roots is overkill, or even necessary for SMT.

I think we discuss snapshots because of the "storage" part of proposal not "state commitment" part. For SMT, I also don't see a reason to use DB-level snapshots.

So, what do you think about shifting the snapshot functionality to DB?

I agree that it is more than reasonable to use capabilities of DB to ensure we can access&manage previous versions of state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
We need to have old versions of SMT for snapshots. So, I see two options here:

  1. Whenever we schedule a snapshot, we do it in a same way as for state snapshots: using DB builtin feature. If we need to create a new node, and orphan some other node, then we immediately remove orphan nodes. So, effectively, we won't have old roots because we will remove them immediately.
  2. Extend the current LL SMT and add pruning.

PS: I assume in your current implementation, you don't update nodes - instead you create a new one, right? This allows you to keep all old versions. Why do you do that, instead of updating nodes?

Copy link

@musalbas musalbas Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add pruning to LL SMT anyway, otherwise the state will keep growing.

PS: I assume in your current implementation, you don't update nodes - instead you create a new one, right? This allows you to keep all old versions. Why do you do that, instead of updating nodes?

Nodes are stored in a key => value store where the key is a hash of the node, and the value is the preimage of the hash (i.e. the hashes of the children of the node). Assuming the hashes are collision resistant, you can't "update" a node since the tree is immutable - you can only "create" a new tree with a new root.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have a path to the updated Leaf, so you can use it to remove old nodes:

  1. Given key, we traverse the tree with path=hash(key)
  2. While going down to a node, we can remember all nodes on the path
  3. When going back and updating (or creating new ones) internal nodes, we can remove old node - we use the path we constructed in the step 2.

Many underlying DB engines support snapshotting. Hence, we propose to reuse that functionality and limit the supported DB engines to ones which support snapshots (Badger, RocksDB, BoltDB) using a _copy on write_ mechanism.

### Pruning

At minimum SC doesn't need to keep old versions. However we need to be able to process transactions and roll-back state updates if transaction fails. This can be done in the following way:dDuring transaction processing, we keep all state change requests (writes) in a `CacheWrapper` abstraction (as it's done today). Only when we commit on a root store, all changes are written to the the SMT.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

We can use the same approach for SM Storage. However, we need to keep few past versions (configurable by user, eg: 10 past versions every 100 blocks) in a form of snapshot. Ideally we would like to shift that functionality to a DB engine itself.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

TODO: Verify which DB engines support that. I'm pretty confident this (pruning and versioning)can and should be offloaded to a DB engine.
Otherwise, the solution is to implement a sort of _mark and sweep GC_: once per defined period, a GC will start, mark old objects and prune them. This will require encoding a version mechanism in a KV store.



## Consequences


### Backwards Compatibility

This ADR doesn't introduce any SDK level API changes.

We change a storage layout, so storage migration and a blockchain reboot is required.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

### Positive

+ Decoupling state from state commitment introduce better engineering opportunities for further optimizations and better storage patterns.
+ Performance improvements.
+ Joining SMT based camp which has wider and proven adoption than IAVL. Example projects which decided on SMT: Ethereum2, Diem (Libra), Trillan, Tezos, LazyLedger.

### Negative

+ Storage migration
+ LL SMT doesn't support pruning - we will need to add and test that functionality.

### Neutral

+ Deprecating IAVL, which is one of the core proposals of Cosmos Whitepaper.
tac0turtle marked this conversation as resolved.
Show resolved Hide resolved


## Further Discussions

### RDBMS

Use of RDBMS instead of simple KV store for state. Use of RDBMS will require an SDK API breaking change (`KVStore` interface), will allow better data extraction and indexing solutions. Instead of saving an object as a single blob of bytes, we could save it as record in a table in the state storage layer, and as a `hash(key, protobuf(object))` in the SMT as outlined above. To verify that an object registered in RDBMS is same as the one committed to SMT, one will need to load it from RDBMS, marshal using protobuf, hash and do SMT search.


## References
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

+ [IAVL What's Next?](https://github.com/cosmos/cosmos-sdk/issues/7100)
+ [IAVL overview](https://docs.google.com/document/d/16Z_hW2rSAmoyMENO-RlAhQjAG3mSNKsQueMnKpmcBv0/edit#heading=h.yd2th7x3o1iv) of it's state v0.15
+ [State commitments and storage report](https://paper.dropbox.com/published/State-commitments-and-storage-review--BDvA1MLwRtOx55KRihJ5xxLbBw-KeEB7eOd11pNrZvVtqUgL3h)
+ [LazyLedger SMT](https://github.com/lazyledger/smt)
+ Facebook Diem (Libra) SMT [design](https://developers.diem.com/papers/jellyfish-merkle-tree/2021-01-14.pdf)
+ [Trillian Revocation Transparency](https://github.com/google/trillian/blob/master/docs/papers/RevocationTransparency.pdf), [Trillian Verifiable Data Structures](https://github.com/google/trillian/blob/master/docs/papers/VerifiableDataStructures.pdf).