Another possibility for hash_tree_root of dynamic lists #1115

vbuterin · 2019-05-23T16:43:11Z

Epistemic status: very uncertain if this is a good idea but IMO worth talking about the issues.

Status quo

In English: make a Merkle tree of the data, padding it with zeroes to make the length an exact power of two. Then take the hash of the root of that tree together with the length of the list.

In code:

def merkle_tree(lst):
    o = [0] * next_power_of_two(len(lst)) + lst + [0] * (next_power_of_two(len(lst)) - len(lst))
    for i in range(next_power_of_two(len(lst))-1, 0, -1):
       o[i] = hash(o[i*2] + o[i*2+1])
    return o[1]
def hash_tree_root(lst):
    return hash(merkle_tree(lst) + len(lst).to_bytes(32, 'little'))

Example with three-item list:

      R
     / \
    /\  3
   /  \
  /\  /\
 A B  C 0

Alternative option

In English: make an imbalanced tree where the top left node is the length, then the right->left node points to the first 2 elements, then the right->right->left node points to the next 4 elements, and so forth, so at any depth N there's a root of a tree of 2**(N-1) of the elements.

In code:

def merkle_tree(lst, full_length, empty=b'\x00'*32):
    if full_length == 1:
        return lst[0]
    if len(lst) % 2 == 1:
        lst = lst + [empty]
    return merkle_tree(
        [hash(lst[i] + lst[i+1]) for i in range(0, len(lst), 2)],
        full_length//2, 
        empty=hash(empty+empty)
    )
def hash_tree_root(lst):
    o = b'\x00'*32
    lst = [None, len(lst).to_bytes(32, 'little')] + lst
    lst += [b'\x00' * 32 + (next_power_of_2(len(lst)+1) - len(lst) - 1)]
    for i in range(log2(len(lst))-1, -1, -1):
        o = hash(merkle_tree(lst[2**i:2**(i+1)], 2**i) + o)
    return o

Example with three-item list:

     R
    / \
   3  /\
     /\ \
    A B /\
       /  \
      /\   0
     /  \  
    /\  /\
   C 0 0 0

Rationale

Currently there is a perfect correspondence between SSZ path (eg. obj -> obj.member[17].other_member[3]) and tree path (the steps of where you descend left or right from the root to get to that specific value) in all cases except for one: that of dynamic lists. This increases complexity for light client protocol implementers because:

A proof of a single value in an SSZ tree may require multiple logical branches rather than just one to determine what the length is so that the verifier can determine the depth to prove a value (technically the length will always be along the Merkle branch that proves any item, but it's still extra complexity to extract it)
pop or append operations to lists, if they cross the power-of-two boundary, require a rebalancing of the tree that changes the paths to every item in an entire subtree.

This proposal makes it so that the correspondence between SSZ path and tree path is always one-to-one, and any operation is a single Merkle branch update (or two Merkle branch updates to update the length for an append or pop).

Weaknesses:

More base-protocol and implementation complexity
~1 extra hash per Merkle proof of a single value
Still need to check length values if you want to check validity
Insert and pop still require nontrivial extra logic to append or pop items

The text was updated successfully, but these errors were encountered:

protolambda · 2019-05-23T19:10:35Z

Some comments on the Rationale part, with possible suggestions:

We already length-mix-in it right? And getting the next power of 2 is constant time. The real problem is just the extra logic step necessary to locate both check the branch length and the proof (correct?). Maybe we can mix in the next-power of 2 as well, enforce that, and use it to proof the leaf together with the length? Locating the item (considering re-balances) still requires some effort, but really is not that complicated, and worth whatever the small cost. Edit: real challenge: need to keep branch paths stable with additions.
List re-balancing is not so bad. I think it is better than having to hash extra ~~linearly~~ for each additional element, ~~effectively making dynamic lists much less applicable than fixed length lists.~~ Edit: if each next level is expanded more and more, it's not linear. But item at index 0 still has huge benefit with this design.
I do like how it preserves the binary-tree generelized index system, but indices can get ~~very~~ longer this way because of the imbalance. ~~IMO, something to avoid at other costs. (e.g. you don't want a 10k bits bitstring as generalized index to identify a validator in a large dynamic list, like the registry)~~ Not linear, but also not perfect.

Also, I considered a similar idea during Edcon in sydney, but not directly applied to merkle trees; i.e. use a bitstring to identify which elements in a structure are present (effectively the same as a generalized index, if you only consider 1 element), but for multiple elements. If ssz-encoded messages are prefixed with such a thing, you effectively implement the ability to null/union everything you want (reason for bringing up the idea, nulls/unions are still not implemented :( ). But for large lists it is not practical for the same reasons as a linear merkle structure is not.

Edit TLDR: misunderstood the structure. It's not linear, evert level keeps expanding more, to allow for bigger additions. But high level leafs still have some cost benefits over lower level leafs.

And here's a quick helper function for calculating the next power of 2 for merkleization, for in future pseudo-code/experiments. I updated the merkle tree code in the SSZ typing draft PR, but that is still stuck in a limbo status after I changed back to testing. https://github.com/ethereum/eth2.0-specs/blob/08faa86d706c31cd9690cb483d0df32f66696396/test_libs/pyspec/eth2spec/utils/merkle_minimal.py

vbuterin · 2019-05-23T23:22:47Z

(e.g. you don't want a 10k bits bitstring as generalized index to identify a validator in a large dynamic list, like the registry)

To be clear, there is no linear length blowup like this; the increase in length is only one single hash (this is because the number of leaves at each level of the tree still goes up by 2 per level). If Merkle proofs were length 8 before, on average they'll be ~length 9.

And here's a quick helper function for calculating the next power of 2 for merkleization,

Ooh, fun! Probably not the best thing to put in code that's meant for people to read (def next_power_of_2(x): return 1 if x == 1 else 2 * next_power_of_2((x+1)//2) is clearer to mathematically reason about in your head and understand), but definitely much higher performance.

dankrad · 2019-05-25T09:12:03Z

I think I found a much more elegant way to write next_power_of_2 in python in when working on the phase 1 execution:

def next_power_of_2(v: int) -> int:
    return 1 << v.bit_length()

protolambda · 2019-05-25T13:42:16Z

@dankrad Nice, knew about a few other scripting languages that tracked bitlength for large integers, but forgot about it here. Here's some testing + benching code: https://gist.github.com/protolambda/9ad3f46665cb9bdfdb38599cbc626c30 Bitlength outperforms mine by 2x, and the old one by 15x

Now back to merkle-tree discussion :)

protolambda · 2019-05-25T15:05:57Z

Another more stylistic (but experimental and somewhat broken) approach:

Act like it is a fixed-length list of max. size (e.g. 32), and do a normal binary-tree merkle-root. Very similar to what we currently have with deposit-tree. However, we wrap the hash-function, to make smaller lists less expensive. This new hash-function definition would be:

def combine_chunks(a, b, level, index_at_level, length_at_level): # level of the combination node, index of the combination node, length of the level
   # check if b is out of bounds
   if index_at_level + 1 == length_at_level:
      assert b == ZERO_CHUNK
      return a  # just propagate a if b is zero.
   elif level < MAX_LEVELS:
      return hash(a + b) # hash of concatenated chunks, like currently
   else:
      return hash(a + b + to_chunk(index_at_level))
# Note: last two returns could be combined, see comments on consistency.

note: verification is not exactly the same, one needs to mix in the index at leaf-level based on type

Now, we can ignore the empty part of the tree, and just supply a bunch of zeroes (not higher order hashes of zero) for the branch part that completes it to a MAX_LEVELS deep merkle-tree.

Note that we are still mixing in the length, to differentiate a list ending in zeroes with the same list without (some of) these zeroes.

And the zeroes in the merkle-proof can be compressed where necessary. And the hash with zero is essentially free, so that's great. (side-effect: hashing default lists filled with 0 will be super fast 🎉)

Now leafs have an easily identifiable place, and have "the same" proof-length, and the same general index, at all times.
Another nice property is that the generalized index is super easy to compute with the standard formula 2**depth + index since we have a virtual constant depth of MAX_LEVELS+1 (don't forget the length mix-in). So the average generalized index looks like 100000.....000001101010 (big endian here, index at the right end). Given that we are not nesting dynamic lists super deep, maybe 3 levels at most, this inefficiency seems fine.

Open question: if one would agree to standardize the compression of the zero chunks in the proof, we essentially are using the length data to read the proof (very much the same as we had before the unbalanced tree idea, now just supporting static indices without extra hashing).

I would put the MAX_LEVELS at 31 or 63, to make the depth of the tree, with length mix-in, a nice power of 2.

TLDR:

virtual generalized indices, to keep them in the same place, good for light-clients.
some extra zeroes in proof, easily compressed if necessary
effectively 0 extra hashing, absolute minimum hashing for dynamically sized binary merkle tree.
maximum dynamic list size, but adjustable per use-case (change MAX_LEVELS).
positioning and generalized indices are easy and fast to compute.

SHA-256

So, standard merkle input of 2 chunks is: 32*2*8=512 bits. However, SHA-256 mixes in the length of the input 64 bits, and a non-zero delimiter 1 bit. And then it pads to a multiple of 512 bits, each 512 bits being a chunk to hash on top of the starting state. This means that we have 512-64-1=447 bits to mix in extra data, like the pair-index, for essentially free.

Light clients / Verification flow

Now, we can have a light client ask for a piece of data, with a proof. "hey full node, give me data at general index 1234 please". The light client then verifies the proof like a normal merkle tree, but additionally mixes in the expected index in the dynamic list during verification, whenever it passes a node that is part of the leaf-level of a dynamic list.

Pro: Stable generalized indices, simple proof construction, easy proof compression, and length-independent verification.

Con: verification is more complex. But if we ask for data, we are not asking for it because of random reasons, we know the location + typing we are asking for, so we can deal with the verification just fine.

Consistent complexity, no typing

We could make the thing more consistent with unnecessary mix-ins: mix in the general index at every non-out-of-bounds node combination. Now, a verifier doesn't have to care about the type anymore, it can just repeatedly mix in the index. For free, if this index is within 447 bits.

Consistency with normal verification

Instead of mixing in a third argument at leaf combination level, we could also do an extra hash, and mix in the index of the pair at that level. This makes it valid to do the current merkle-proof verification. At the cost of 1 hash per pair of leafs.

protolambda · 2019-05-25T20:55:24Z

Ok, let's summarize all the variants of merkleization, for dynamic-length lists:

Status Quo

balanced binary tree
generalized indices work, but indices change when tree balance changes (on power of 2 boundaries)
simple, standard verification
proofs are dynamic size (tree size power-of-2 change)
generalized indices are easy to compute, but leaf depth can change

Unbalanced Merkle Tree

unbalanced binary tree. Cost is approx. 1 hash extra for proof of 1 element.
fixed indices, even with append/pop
some indices cheaper, others more expensive to verify
simple, standard verification
proofs are a dynamic size (larger indices, more cost)
generalized indices are easy to compute, but leaf depth is variable.

Virtually balanced Merkle Tree

Note: wrapped special-case hash-function introduces many edge-cases, design variants need to be adjusted still

For all sub-variants applies:

balanced, extra cost depends on sub-variant, 1 extra hash at most
fixed indices
all nodes have equal cost verification
maximum on list contents, can be high.
proofs are constant size, but the size is relatively big. The difference is that it's filled with zeroes. Although, that is easy to compress.
generalized indices are easy to compute, but relatively large (MAX_LEVELS + 1 bits), leaf depth is fixed

Leaf-pair index mix-in

0 extra hashing cost. Since indices are mixed into the 447 bits available SHA-256 input space.
Non-standard verification. One needs add the index to the hash input at the right level(s) of the merkle proof, to get the correct hash verification.
H(a, 0) -> a, except for leaf-level, see 1-hash extra below for comments on collisions + efficiency.

Consistent index mix-in

0 extra hashing cost, same available hash input space is used, at every level.
Non-standard verification. But no detailed type information necessary. Just know to add the index at the height or generalized index (TBD) to the hash input.

1-hash extra

just 1 hash more than absolute minimum
familiar verification. Mix in (additional hash) the index for each pair of leafs, like combination of normal chunks. Possibly cheap, for multi-proofs of adjacent data.
essentially the balanced version of the unbalanced approach. Trade-off: preset maximum size, zeroes need to be compressed. Reward: standard generalized index computation, balanced tree.
H(a, 0) -> a, except for extra-hash level. This potential case, but with the covered index-mix-based collision work-around, means we could re-use some of the hashes. (but tree is not sparse, so small benefit)

Sparse merkle-tree alike

Also a virtually balanced tree, but with different hashing.
- fixed indices
- all nodes have equal cost verification
- maximum on list contents, can be high.
- similar proof + generalized index size issues as the other virtual tree. Both within manageable (optimizable) size still, see full comment.
works for sparse merkle trees. Can be optimized not too, and only handle the H(a, 0) case, not the H(0, b) case.
Non-standard verification: One needs to deal with 1 or 2 bytes of hash output being thrown away.
cheap (but not cheapest) approach to merkle-tree zeroes.

Sources

Generalized indices: eth2 light client specs > generalized index

0-based generalized indices: eth2 specs Issue 1008

SHA-256 pseudocode, performance related: on SHA-2-family wikipedia

Eth2 exec-spec merkle code: merkle_minimal.py, ssz code > merkleize(chunks)

Credits to Vitalik for pushing so many different ideas forward. Hope my alternative + this summary helps to get to a balanced, easy, small and constant index supporting merkle tree design.

vbuterin · 2019-05-26T18:36:09Z

simple, standard verification. Mix in (additional hash) the index for each pair of leafs. Possibly cheap, for multi-proofs of adjacent data.

What do you mean by "the index" here? The index in the list?

I would not call it "standard verification" if the generalized index alone isn't enough to fully define the verification path + sequence of operations required to verify a branch, so if there's extra data being added in at some levels of the tree that breaks the invariant...

protolambda · 2019-05-27T11:18:32Z

What do you mean by "the index" here? The index in the list?

I think it can be both. It's really just more an idea than a verified/tested approach. We can start breaking + fixing things when the basic idea is there.

I would not call it "standard verification"

Hmm, yes, hash function is different still. But it's more "standard" than adding a additional data to the hash input of combining leafs (like the other variants). Instead It's like a normal mix-in. A+B->P, P+I->X (A and B are leafs, I is the index). So branch verification doesn't break, it's just 1 longer. And what I'm trying to do here is to make the leaf elements effectively unique, to avoid collision problems. (Yes, there are issues here still, no free lunch in CS ...) Although I'm not sure what exact "attacks" are actually fine. E.g. I'm fine with people supplying shorter proofs when the result of verification for every index is the same. But when leafs are interchangeable, or leafs can be replaced with their hashes, or something else, things could go very wrong.

JustinDrake · 2019-06-15T19:44:34Z

Closing in favour of issue #1160 which has now achieved rough consensus :)

vbuterin changed the title ~~Proposal for hash_tree_root of dynamic lists~~ Another possibility for hash_tree_root of dynamic lists May 23, 2019

protolambda mentioned this issue Jun 4, 2019

SSZ implementation for exec. spec - Support for Python 3 typing. #1077

Merged

vbuterin mentioned this issue Jun 10, 2019

Reforming dynamic lists in SSZ #1160

Closed

hwwhww added the scope:SSZ Simple Serialize label Jun 15, 2019

JustinDrake closed this as completed Jun 15, 2019

cdetrio mentioned this issue Jul 20, 2019

Proposed structure for SSZ partials #1303

Open

protolambda mentioned this issue Nov 8, 2019

Sparse Merkle Trees (SMTs) designs #1472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another possibility for hash_tree_root of dynamic lists #1115

Another possibility for hash_tree_root of dynamic lists #1115

vbuterin commented May 23, 2019 •

edited

Loading

protolambda commented May 23, 2019 •

edited

Loading

vbuterin commented May 23, 2019

dankrad commented May 25, 2019

protolambda commented May 25, 2019

protolambda commented May 25, 2019 •

edited

Loading

protolambda commented May 25, 2019 •

edited

Loading

vbuterin commented May 26, 2019

protolambda commented May 27, 2019

JustinDrake commented Jun 15, 2019

Another possibility for hash_tree_root of dynamic lists #1115

Another possibility for hash_tree_root of dynamic lists #1115

Comments

vbuterin commented May 23, 2019 • edited Loading

Status quo

Alternative option

Rationale

protolambda commented May 23, 2019 • edited Loading

vbuterin commented May 23, 2019

dankrad commented May 25, 2019

protolambda commented May 25, 2019

protolambda commented May 25, 2019 • edited Loading

SHA-256

Light clients / Verification flow

Consistent complexity, no typing

Consistency with normal verification

protolambda commented May 25, 2019 • edited Loading

Status Quo

Unbalanced Merkle Tree

Virtually balanced Merkle Tree

Leaf-pair index mix-in

Consistent index mix-in

1-hash extra

Sparse merkle-tree alike

Sources

vbuterin commented May 26, 2019

protolambda commented May 27, 2019

JustinDrake commented Jun 15, 2019

vbuterin commented May 23, 2019 •

edited

Loading

protolambda commented May 23, 2019 •

edited

Loading

protolambda commented May 25, 2019 •

edited

Loading

protolambda commented May 25, 2019 •

edited

Loading