Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another possibility for hash_tree_root of dynamic lists #1115

Closed
vbuterin opened this issue May 23, 2019 · 9 comments
Closed

Another possibility for hash_tree_root of dynamic lists #1115

vbuterin opened this issue May 23, 2019 · 9 comments
Labels
scope:SSZ Simple Serialize

Comments

@vbuterin
Copy link
Contributor

vbuterin commented May 23, 2019

Epistemic status: very uncertain if this is a good idea but IMO worth talking about the issues.

Status quo

In English: make a Merkle tree of the data, padding it with zeroes to make the length an exact power of two. Then take the hash of the root of that tree together with the length of the list.

In code:

def merkle_tree(lst):
    o = [0] * next_power_of_two(len(lst)) + lst + [0] * (next_power_of_two(len(lst)) - len(lst))
    for i in range(next_power_of_two(len(lst))-1, 0, -1):
       o[i] = hash(o[i*2] + o[i*2+1])
    return o[1]
def hash_tree_root(lst):
    return hash(merkle_tree(lst) + len(lst).to_bytes(32, 'little'))

Example with three-item list:

      R
     / \
    /\  3
   /  \
  /\  /\
 A B  C 0

Alternative option

In English: make an imbalanced tree where the top left node is the length, then the right->left node points to the first 2 elements, then the right->right->left node points to the next 4 elements, and so forth, so at any depth N there's a root of a tree of 2**(N-1) of the elements.

In code:

def merkle_tree(lst, full_length, empty=b'\x00'*32):
    if full_length == 1:
        return lst[0]
    if len(lst) % 2 == 1:
        lst = lst + [empty]
    return merkle_tree(
        [hash(lst[i] + lst[i+1]) for i in range(0, len(lst), 2)],
        full_length//2, 
        empty=hash(empty+empty)
    )
def hash_tree_root(lst):
    o = b'\x00'*32
    lst = [None, len(lst).to_bytes(32, 'little')] + lst
    lst += [b'\x00' * 32 + (next_power_of_2(len(lst)+1) - len(lst) - 1)]
    for i in range(log2(len(lst))-1, -1, -1):
        o = hash(merkle_tree(lst[2**i:2**(i+1)], 2**i) + o)
    return o

Example with three-item list:

     R
    / \
   3  /\
     /\ \
    A B /\
       /  \
      /\   0
     /  \  
    /\  /\
   C 0 0 0

Rationale

Currently there is a perfect correspondence between SSZ path (eg. obj -> obj.member[17].other_member[3]) and tree path (the steps of where you descend left or right from the root to get to that specific value) in all cases except for one: that of dynamic lists. This increases complexity for light client protocol implementers because:

  1. A proof of a single value in an SSZ tree may require multiple logical branches rather than just one to determine what the length is so that the verifier can determine the depth to prove a value (technically the length will always be along the Merkle branch that proves any item, but it's still extra complexity to extract it)

  2. pop or append operations to lists, if they cross the power-of-two boundary, require a rebalancing of the tree that changes the paths to every item in an entire subtree.

This proposal makes it so that the correspondence between SSZ path and tree path is always one-to-one, and any operation is a single Merkle branch update (or two Merkle branch updates to update the length for an append or pop).

Weaknesses:

  • More base-protocol and implementation complexity
  • ~1 extra hash per Merkle proof of a single value
  • Still need to check length values if you want to check validity
  • Insert and pop still require nontrivial extra logic to append or pop items
@vbuterin vbuterin changed the title Proposal for hash_tree_root of dynamic lists Another possibility for hash_tree_root of dynamic lists May 23, 2019
@protolambda
Copy link
Collaborator

protolambda commented May 23, 2019

Some comments on the Rationale part, with possible suggestions:

  1. We already length-mix-in it right? And getting the next power of 2 is constant time. The real problem is just the extra logic step necessary to locate both check the branch length and the proof (correct?). Maybe we can mix in the next-power of 2 as well, enforce that, and use it to proof the leaf together with the length? Locating the item (considering re-balances) still requires some effort, but really is not that complicated, and worth whatever the small cost. Edit: real challenge: need to keep branch paths stable with additions.
  2. List re-balancing is not so bad. I think it is better than having to hash extra linearly for each additional element, effectively making dynamic lists much less applicable than fixed length lists. Edit: if each next level is expanded more and more, it's not linear. But item at index 0 still has huge benefit with this design.
  3. I do like how it preserves the binary-tree generelized index system, but indices can get very longer this way because of the imbalance. IMO, something to avoid at other costs. (e.g. you don't want a 10k bits bitstring as generalized index to identify a validator in a large dynamic list, like the registry) Not linear, but also not perfect.

Also, I considered a similar idea during Edcon in sydney, but not directly applied to merkle trees; i.e. use a bitstring to identify which elements in a structure are present (effectively the same as a generalized index, if you only consider 1 element), but for multiple elements. If ssz-encoded messages are prefixed with such a thing, you effectively implement the ability to null/union everything you want (reason for bringing up the idea, nulls/unions are still not implemented :( ). But for large lists it is not practical for the same reasons as a linear merkle structure is not.

Edit TLDR: misunderstood the structure. It's not linear, evert level keeps expanding more, to allow for bigger additions. But high level leafs still have some cost benefits over lower level leafs.

And here's a quick helper function for calculating the next power of 2 for merkleization, for in future pseudo-code/experiments. I updated the merkle tree code in the SSZ typing draft PR, but that is still stuck in a limbo status after I changed back to testing. https://github.com/ethereum/eth2.0-specs/blob/08faa86d706c31cd9690cb483d0df32f66696396/test_libs/pyspec/eth2spec/utils/merkle_minimal.py

@vbuterin
Copy link
Contributor Author

(e.g. you don't want a 10k bits bitstring as generalized index to identify a validator in a large dynamic list, like the registry)

To be clear, there is no linear length blowup like this; the increase in length is only one single hash (this is because the number of leaves at each level of the tree still goes up by 2 per level). If Merkle proofs were length 8 before, on average they'll be ~length 9.

And here's a quick helper function for calculating the next power of 2 for merkleization,

Ooh, fun! Probably not the best thing to put in code that's meant for people to read (def next_power_of_2(x): return 1 if x == 1 else 2 * next_power_of_2((x+1)//2) is clearer to mathematically reason about in your head and understand), but definitely much higher performance.

@dankrad
Copy link
Contributor

dankrad commented May 25, 2019

I think I found a much more elegant way to write next_power_of_2 in python in when working on the phase 1 execution:

def next_power_of_2(v: int) -> int:
    return 1 << v.bit_length()

@protolambda
Copy link
Collaborator

@dankrad Nice, knew about a few other scripting languages that tracked bitlength for large integers, but forgot about it here. Here's some testing + benching code: https://gist.github.com/protolambda/9ad3f46665cb9bdfdb38599cbc626c30 Bitlength outperforms mine by 2x, and the old one by 15x

Now back to merkle-tree discussion :)

@protolambda
Copy link
Collaborator

protolambda commented May 25, 2019

Another more stylistic (but experimental and somewhat broken) approach:

Act like it is a fixed-length list of max. size (e.g. 32), and do a normal binary-tree merkle-root. Very similar to what we currently have with deposit-tree. However, we wrap the hash-function, to make smaller lists less expensive. This new hash-function definition would be:

def combine_chunks(a, b, level, index_at_level, length_at_level): # level of the combination node, index of the combination node, length of the level
   # check if b is out of bounds
   if index_at_level + 1 == length_at_level:
      assert b == ZERO_CHUNK
      return a  # just propagate a if b is zero.
   elif level < MAX_LEVELS:
      return hash(a + b) # hash of concatenated chunks, like currently
   else:
      return hash(a + b + to_chunk(index_at_level))
# Note: last two returns could be combined, see comments on consistency.

note: verification is not exactly the same, one needs to mix in the index at leaf-level based on type

Now, we can ignore the empty part of the tree, and just supply a bunch of zeroes (not higher order hashes of zero) for the branch part that completes it to a MAX_LEVELS deep merkle-tree.

               R
              / \
            / \  3
          / \  0
       ...   0
     / \
   /  \  0
  /\  /\
 A B  C 0

Note that we are still mixing in the length, to differentiate a list ending in zeroes with the same list without (some of) these zeroes.

And the zeroes in the merkle-proof can be compressed where necessary. And the hash with zero is essentially free, so that's great. (side-effect: hashing default lists filled with 0 will be super fast 🎉)

Now leafs have an easily identifiable place, and have "the same" proof-length, and the same general index, at all times.
Another nice property is that the generalized index is super easy to compute with the standard formula 2**depth + index since we have a virtual constant depth of MAX_LEVELS+1 (don't forget the length mix-in). So the average generalized index looks like 100000.....000001101010 (big endian here, index at the right end). Given that we are not nesting dynamic lists super deep, maybe 3 levels at most, this inefficiency seems fine.

Open question: if one would agree to standardize the compression of the zero chunks in the proof, we essentially are using the length data to read the proof (very much the same as we had before the unbalanced tree idea, now just supporting static indices without extra hashing).

I would put the MAX_LEVELS at 31 or 63, to make the depth of the tree, with length mix-in, a nice power of 2.

TLDR:

  • virtual generalized indices, to keep them in the same place, good for light-clients.
  • some extra zeroes in proof, easily compressed if necessary
  • effectively 0 extra hashing, absolute minimum hashing for dynamically sized binary merkle tree.
  • maximum dynamic list size, but adjustable per use-case (change MAX_LEVELS).
  • positioning and generalized indices are easy and fast to compute.

SHA-256

So, standard merkle input of 2 chunks is: 32*2*8=512 bits. However, SHA-256 mixes in the length of the input 64 bits, and a non-zero delimiter 1 bit. And then it pads to a multiple of 512 bits, each 512 bits being a chunk to hash on top of the starting state. This means that we have 512-64-1=447 bits to mix in extra data, like the pair-index, for essentially free.

Light clients / Verification flow

Now, we can have a light client ask for a piece of data, with a proof. "hey full node, give me data at general index 1234 please". The light client then verifies the proof like a normal merkle tree, but additionally mixes in the expected index in the dynamic list during verification, whenever it passes a node that is part of the leaf-level of a dynamic list.

Pro: Stable generalized indices, simple proof construction, easy proof compression, and length-independent verification.

Con: verification is more complex. But if we ask for data, we are not asking for it because of random reasons, we know the location + typing we are asking for, so we can deal with the verification just fine.

Consistent complexity, no typing

We could make the thing more consistent with unnecessary mix-ins: mix in the general index at every non-out-of-bounds node combination. Now, a verifier doesn't have to care about the type anymore, it can just repeatedly mix in the index. For free, if this index is within 447 bits.

Consistency with normal verification

Instead of mixing in a third argument at leaf combination level, we could also do an extra hash, and mix in the index of the pair at that level. This makes it valid to do the current merkle-proof verification. At the cost of 1 hash per pair of leafs.

@protolambda
Copy link
Collaborator

protolambda commented May 25, 2019

Ok, let's summarize all the variants of merkleization, for dynamic-length lists:

Status Quo

  • balanced binary tree
  • generalized indices work, but indices change when tree balance changes (on power of 2 boundaries)
  • simple, standard verification
  • proofs are dynamic size (tree size power-of-2 change)
  • generalized indices are easy to compute, but leaf depth can change

Unbalanced Merkle Tree

  • unbalanced binary tree. Cost is approx. 1 hash extra for proof of 1 element.
  • fixed indices, even with append/pop
  • some indices cheaper, others more expensive to verify
  • simple, standard verification
  • proofs are a dynamic size (larger indices, more cost)
  • generalized indices are easy to compute, but leaf depth is variable.

Virtually balanced Merkle Tree

Note: wrapped special-case hash-function introduces many edge-cases, design variants need to be adjusted still

For all sub-variants applies:

  • balanced, extra cost depends on sub-variant, 1 extra hash at most
  • fixed indices
  • all nodes have equal cost verification
  • maximum on list contents, can be high.
  • proofs are constant size, but the size is relatively big. The difference is that it's filled with zeroes. Although, that is easy to compress.
  • generalized indices are easy to compute, but relatively large (MAX_LEVELS + 1 bits), leaf depth is fixed

Leaf-pair index mix-in

  • 0 extra hashing cost. Since indices are mixed into the 447 bits available SHA-256 input space.
  • Non-standard verification. One needs add the index to the hash input at the right level(s) of the merkle proof, to get the correct hash verification.
  • H(a, 0) -> a, except for leaf-level, see 1-hash extra below for comments on collisions + efficiency.

Consistent index mix-in

  • 0 extra hashing cost, same available hash input space is used, at every level.
  • Non-standard verification. But no detailed type information necessary. Just know to add the index at the height or generalized index (TBD) to the hash input.

1-hash extra

  • just 1 hash more than absolute minimum
  • familiar verification. Mix in (additional hash) the index for each pair of leafs, like combination of normal chunks. Possibly cheap, for multi-proofs of adjacent data.
  • essentially the balanced version of the unbalanced approach. Trade-off: preset maximum size, zeroes need to be compressed. Reward: standard generalized index computation, balanced tree.
  • H(a, 0) -> a, except for extra-hash level. This potential case, but with the covered index-mix-based collision work-around, means we could re-use some of the hashes. (but tree is not sparse, so small benefit)

Sparse merkle-tree alike

  • Also a virtually balanced tree, but with different hashing.
    • fixed indices
    • all nodes have equal cost verification
    • maximum on list contents, can be high.
    • similar proof + generalized index size issues as the other virtual tree. Both within manageable (optimizable) size still, see full comment.
  • works for sparse merkle trees. Can be optimized not too, and only handle the H(a, 0) case, not the H(0, b) case.
  • Non-standard verification: One needs to deal with 1 or 2 bytes of hash output being thrown away.
  • cheap (but not cheapest) approach to merkle-tree zeroes.

Sources

Generalized indices: eth2 light client specs > generalized index

0-based generalized indices: eth2 specs Issue 1008

SHA-256 pseudocode, performance related: on SHA-2-family wikipedia

Eth2 exec-spec merkle code: merkle_minimal.py, ssz code > merkleize(chunks)

Credits to Vitalik for pushing so many different ideas forward. Hope my alternative + this summary helps to get to a balanced, easy, small and constant index supporting merkle tree design.

@vbuterin
Copy link
Contributor Author

simple, standard verification. Mix in (additional hash) the index for each pair of leafs. Possibly cheap, for multi-proofs of adjacent data.

What do you mean by "the index" here? The index in the list?

I would not call it "standard verification" if the generalized index alone isn't enough to fully define the verification path + sequence of operations required to verify a branch, so if there's extra data being added in at some levels of the tree that breaks the invariant...

@protolambda
Copy link
Collaborator

What do you mean by "the index" here? The index in the list?

I think it can be both. It's really just more an idea than a verified/tested approach. We can start breaking + fixing things when the basic idea is there.

I would not call it "standard verification"

Hmm, yes, hash function is different still. But it's more "standard" than adding a additional data to the hash input of combining leafs (like the other variants). Instead It's like a normal mix-in. A+B->P, P+I->X (A and B are leafs, I is the index). So branch verification doesn't break, it's just 1 longer. And what I'm trying to do here is to make the leaf elements effectively unique, to avoid collision problems. (Yes, there are issues here still, no free lunch in CS ...) Although I'm not sure what exact "attacks" are actually fine. E.g. I'm fine with people supplying shorter proofs when the result of verification for every index is the same. But when leafs are interchangeable, or leafs can be replaced with their hashes, or something else, things could go very wrong.

@JustinDrake
Copy link
Collaborator

Closing in favour of issue #1160 which has now achieved rough consensus :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope:SSZ Simple Serialize
Projects
None yet
Development

No branches or pull requests

5 participants