Cleaner bitfields #1019

JustinDrake · 2019-05-01T15:31:32Z

This draft PR tries to fix a longstanding wart (see #371, #667) regarding bitfields. We define bitfields as lists or vectors of bits (as opposed to bytes and uint64 respectively). Benefits:

Remove ugly-as-hell helper functions verify_bitfield and get_bitfield_bit (save 28 lines)
No more bit manipulation with justification_bitfield—more readable
Merkleisation-friendly justification_bitfield (i.e. justification_bitfield does not change unless a new epoch is justified)
Increased number of bits in justification_bitfield in case dApps and light clients find it useful (set to EPOCHS_PER_HISTORICAL_VECTOR == 8192)
Semantically natural typing (bytes and uint64 were hacks)
Use bit type for custody_bit

This requires two minor SSZ changes:

Add bit as an alias to bool (cosmetic)
Pack lists and vectors of bits similar to packing of uintN (substantive)

This draft PR try to fix a longstanding wart (see #196, #371, #667) regarding bitfields. We define bitfields as lists or vectors of `bool`s, as opposed to `bytes`. Benefits: * Remove ugly-as-hell helper function `verify_bitfield` * No more bit manipulation with `justification_bitfield` * Merkleisation-friendly `justification_bitfield` (i.e. `justification_bitfield` does not change unless a new epoch is justified). * Easy parametrisation of the number of bits in `justification_bitfield` (can be more or fewer than 64—more may be useful for dApps and light clients) * Semantically cleaner typing This requires a minor SSZ change (Merkleisation not affected) where lists and vectors of `bool`s should be packed, similar to packing of `uintN`. We could alias `bool` to `uint1` so that the packing logic only needs to be defined once for `uintN`.

dankrad · 2019-05-01T15:37:28Z

I'm in favour of this because it adds clarity (bit manipulations should ideally account for as little as possible of the spec, as they are very hard to read).

dankrad · 2019-05-01T15:40:54Z

Rename bool to bit (alternatively, add bit as an alias to bool.

I still prefer the "bool" name as it is a functional description (and "bit" an implementation detail) but on the other hand, "bitfield" feels more natural than "boolfield".

JustinDrake · 2019-05-01T15:58:13Z

vbuterin · 2019-05-02T01:18:05Z

Wouldn't this also require a special case in SSZ? It makes the phase 0 spec cleaner but the SSZ serialization more complicated by about the same amount...

Though you could argue that adding the bigint type to SSZ has a similar effect, and I've advocated for that :)

JustinDrake · 2019-05-03T09:05:29Z

Wouldn't this also require a special case in SSZ?

Yes, the substantive change involves packing lists and vectors of bits similarly to packing lists and vectors of uintN (yay for consistency!). My best guess is that it is a 2-liner in SSZ, versus a 32-liner in the state transition function. I'll wait until after the SSZ rewrite is done so we can make a more informed decision on the SSZ complexity increase (cc @protolambda).

dankrad · 2019-05-03T09:11:19Z

My best guess is that it is a 2-liner in SSZ, versus a 32-liner in the state transition function. I'll wait until after the SSZ rewrite is done so we can make a more informed decision on the SSZ complexity increase (cc @protolambda).

I think I would argue that even if it is not a complexity reduction, we should just do it because SSZ is the right place to deal with data types, and this is clearly a question of data types. Not having to worry about bitfields at higher levels of the spec makes it more readable; basically it should be possible to read the Eth2.0 spec without reading SSZ and just assuming that SSZ will do the right thing when it comes to bools/bits.

vbuterin · 2019-05-04T10:22:18Z

Yes, the substantive change involves packing lists and vectors of bits similarly to packing lists and vectors of uintN (yay for consistency!) My best guess is that it is a 2-liner in SSZ, versus a 32-liner in the state transition function.

Probably a bit more than 2 lines. Here's my attempt:

def serialize_bits(bits):
    o = bytearray((len(bits) + 7) // 8)
    for i, bit in enumerate(bits):
        o[i // 8] |= bit << (i % 8)
    return bytes(o)

Note however that this is insufficient because unlike all other SSZ data structures, where we can deduce the number of items from the length, here we cannot do so, because an N byte list could correspond to a [8N .... 8N+7] size bitfield. So we'd need a custom length prefix (we don't need that today because we're feeding in the length from the committee size, so in some sense we have a "static" data type that's not really "static"). So perhaps we'd have to do something like return bytes([len(bits) % 8]) + o

My instinct is still that bitfields in SSZ are on net a good thing, but there definitely are costs to doing it.

basically it should be possible to read the Eth2.0 spec without reading SSZ and just assuming that SSZ will do the right thing when it comes to bools/bits.

Fully agree with this!

vbuterin · 2019-05-04T10:42:25Z

Ooh, I have another idea for how to serialize variable-sized bitfields: use the generalized index scheme from https://github.com/ethereum/eth2.0-specs/blob/dev/specs/light_client/merkle_proofs.md#generalized-merkle-tree-index. As I describe here, there is a natural correspondence between generalized indices and bitfields, in that a generalized index is a bitfield, with 0s representing when to go left and 1s representing when to go right, but with an extra 1 bit to signify where the bitfield starts from. For example, 9 is the generalized index for "the right child of the left child of the left child of the root", 9 in binary is 1001, where the first 1 is the start marker, and then the 001 is the path from root to leaf. One could take this exact scheme and use it to re-interpret any bigint as a bitfield for any other purpose, and then we'd just need SSZ to support bigints.

JustinDrake · 2019-05-04T11:25:07Z

Probably a bit more than 2 lines. Here's my attempt:

Serialisation is not part of consensus :) Serialisation-specific implementation details (such as a custom length prefix—which is a good point) do not contribute towards the "strict" line count. This PR is a natural continuation to #924.

In the SSZ rewrite in progress (where the short Merkleisation-specific code is to be incorporated in 0_beacon_chain.md) the main thing to do is generalise the pack function from bytes to bits. I wouldn't be surprised if it ends up being 1-liner (or even a 0-liner) 😂Proof is in the pudding though.

I have another idea for how to serialize variable-sized bitfields: use the generalized index scheme

Interesting!

and then we'd just need SSZ to support bigints

Can we simply alias bigint and bitfield to list[bit] and call it a day?

vbuterin · 2019-05-04T11:55:12Z

Can we simply alias bigint and bitfield to list[bit] and call it a day?

We could, though I would argue that bigint is the more natural "base-ish" type. The fact is that the way this serialization would work can't avoid being fundamentally different to how every other List[x] works, and we don't want to confuse people by making non-similar things look similar. That said, naming is secondary, if we're ok with (i) the encoding algorithm for bitfields defined by def bitfield_to_int(bits): return (1 << len(bits)) + sum([(1 << i) * bit for i, bit in enumerate(bits)]) or its inverse def int_to_bitfield(n): return [(n >> i) % 2 for i in range(log2(n))] and (ii) adding an SSZ type that's identical to bytes except it enforces that leading zero bytes are illegal, then I'd be ok describing it either way, as they're equivalent under the hood.

the main thing to do is generalise the pack function from bytes to bits

The big problem here is that in pretty much every computer language, and the underlying architectures, things can't be N bits long for N % 8 != 0. The very output of serialize, and the expected input to hash, are both bytes.

Serialisation is not part of consensus :)

Everything I said above applies to tree-hashing too :)

JustinDrake · 2019-05-04T18:53:26Z

def bitfield_to_int(bits): return (1 << len(bits)) + sum([(1 << i) * bit for i, bit in enumerate(bits)])
def int_to_bitfield(n): return [(n >> i) % 2 for i in range(log2(n))]

This is neat :) The main question is whether we want 0-based or 1-based indexing of generalised indices (#1008). An argument in favour of 0-based indexing is that bigints have support for 0, so with 0-based indexing the concepts of "bigint" and "generalised index" become equivalent.

adding an SSZ type that's identical to bytes except it enforces that leading zero bytes are illegal

It's still not clear to me why adding an SSZ type is necessary given we can use existing types.

the expected input to hash is bytes

Right, the current chunkify function should be generalised to right-pad the last chunk with zero bits (as opposed to zero bytes). Maybe replace

def chunkify(byte_string):
    byte_string += b'\x00' * (-len(byte_string) % 32)
    return [byte_string[i:i + 32] for i in range(0, len(byte_string), 32)]

with something like

def chunkify(bits):
    bits += [0b0] * (-len(bits) % 256)
    byte_string = bytes([sum([byte[b] << b for b in range(0, 8)]) for byte in zip(*(iter(bits),) * 8)])
    return [byte_string[i:i + 32] for i in range(0, len(byte_string), 32)]

vbuterin · 2019-05-05T02:23:11Z

I'm really inclined to say that making bits rather than bytes be the base unit of SSZ just for this single use case is overkill.

It's still not clear to me why adding an SSZ type is necessary given we can use existing types.

To preserve the norm that there's only one valid way to serialize something. If we have an injective map from bitfields to bigints, then we want there to be only one valid way to serialize a bigint, which means that deserializing eg. \x05\x00 should return an error (it should be \x05).

An argument in favour of 0-based indexing is that bigints have support for 0, so with 0-based indexing the concepts of "bigint" and "generalised index" become equivalent.

That is interesting. So if we offset everything by 1 then we get a fully bijective map. Though that's only a really strong desideratum if the goal is to make List[bit] be the base type, and I still feel like that would confuse things, and we should stick to SSZ objects all being byte-based, in which case bigint would be the base type, and we'd only need an injective map (ie. each bitfield has a unique corresponding bigint, not necessarily vice versa). In that case, 1-based is better because the bit representation of the bigint would just be the bitfield with an extra 1 on the end, as opposed to do an awkward subtract-and-carry operation on a bitfield to deserialize it.

djrtwo · 2019-05-05T14:04:36Z

I'm really inclined to say that making bits rather than bytes be the base unit of SSZ just for this single use case is overkill.

Late to the party but very much agree with this.

djrtwo · 2019-05-27T14:47:12Z

Are we closing this in favor of the bigint PR?

JustinDrake · 2019-05-27T15:05:26Z

Are we closing this in favor of the bigint PR?

I guess we want to merge the two. The bigint PR current only touches simple-serialize.md. This PR cleans up 0_beacon-chain.md and can be adapted to however bigints get implemented.

djrtwo · 2019-06-25T22:25:00Z

@dankrad Can we close this in favor of your coming PR?

JustinDrake · 2019-06-27T11:42:00Z

Closing in favour of #1224

JustinDrake added 4 commits May 1, 2019 16:43

Update 0_beacon-chain.md

10d38a2

Update 0_beacon-chain.md

c26983e

Update 0_beacon-chain.md

df8c41d

Update 0_beacon-chain.md

c789a42

JustinDrake added 2 commits May 1, 2019 17:05

Update 0_beacon-chain.md

a012dcf

Update 0_beacon-chain.md

5d7f9e4

JustinDrake added the scope:SSZ Simple Serialize label May 1, 2019

protolambda mentioned this pull request May 1, 2019

implement SOS serialization in minimal_ssz.py #1024

Merged

JustinDrake mentioned this pull request May 5, 2019

Little-endian vs big-endian (take 2) #1046

Closed

JustinDrake mentioned this pull request May 6, 2019

Phase 0 state transition checklist—take 5 #1054

Closed

21 tasks

JustinDrake added the milestone:June 30 freeze 🥶 Phase 0 spec freeze for long-lived cross-client testnet label Jun 15, 2019

JustinDrake closed this Jun 27, 2019

djrtwo deleted the JustinDrake-patch-18 branch August 14, 2019 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaner bitfields #1019

Cleaner bitfields #1019

JustinDrake commented May 1, 2019 •

edited

Loading

dankrad commented May 1, 2019 •

edited

Loading

dankrad commented May 1, 2019 •

edited

Loading

JustinDrake commented May 1, 2019

vbuterin commented May 2, 2019

JustinDrake commented May 3, 2019 •

edited

Loading

dankrad commented May 3, 2019

vbuterin commented May 4, 2019

vbuterin commented May 4, 2019 •

edited

Loading

JustinDrake commented May 4, 2019

vbuterin commented May 4, 2019

JustinDrake commented May 4, 2019

vbuterin commented May 5, 2019

djrtwo commented May 5, 2019

djrtwo commented May 27, 2019

JustinDrake commented May 27, 2019

djrtwo commented Jun 25, 2019

JustinDrake commented Jun 27, 2019

Cleaner bitfields #1019

Cleaner bitfields #1019

Conversation

JustinDrake commented May 1, 2019 • edited Loading

dankrad commented May 1, 2019 • edited Loading

dankrad commented May 1, 2019 • edited Loading

JustinDrake commented May 1, 2019

vbuterin commented May 2, 2019

JustinDrake commented May 3, 2019 • edited Loading

dankrad commented May 3, 2019

vbuterin commented May 4, 2019

vbuterin commented May 4, 2019 • edited Loading

JustinDrake commented May 4, 2019

vbuterin commented May 4, 2019

JustinDrake commented May 4, 2019

vbuterin commented May 5, 2019

djrtwo commented May 5, 2019

djrtwo commented May 27, 2019

JustinDrake commented May 27, 2019

djrtwo commented Jun 25, 2019

JustinDrake commented Jun 27, 2019

JustinDrake commented May 1, 2019 •

edited

Loading

dankrad commented May 1, 2019 •

edited

Loading

dankrad commented May 1, 2019 •

edited

Loading

JustinDrake commented May 3, 2019 •

edited

Loading

vbuterin commented May 4, 2019 •

edited

Loading