diff --git a/data-structures.md b/data-structures.md index 9e7027db6..1a124aafd 100644 --- a/data-structures.md +++ b/data-structures.md @@ -24,7 +24,7 @@ To learn more, take a look at the [Address Spec](https://github.com/filecoin-pro ## CID -For most objects referenced by Filecoin, a Content Identifier (CID for short) is used. This is effectively a hash value, prefixed with its hash function (multihash) prepended with a few extra labels to inform applications about how to deserialize the given data. To learn more, take a look at the [CID Spec](https://github.com/ipld/cid). +For most objects referenced by Filecoin, a Content Identifier (CID for short) is used. This is effectively a hash value, prefixed with its hash function (multihash) prepended with a few extra labels to inform applications about how to deserialize the given data. To learn more, take a look at the [CID Spec](https://github.com/ipld/cid). CIDs are serialized by applying binary multibase encoding, then encoding that as a CBOR byte array with a tag of 42. @@ -68,7 +68,7 @@ type Block struct { // MessageReceipts is a set of receipts matching to the sending of the `Messages`. // TODO: should be the same type of merkletree-list thing that the messages are MessageReceipts []MessageReceipt - + // The block Timestamp is used to enforce a form of block delay by honest miners. // Unix time UTC timestamp stored as an unsigned integer Timestamp Timestamp @@ -363,3 +363,59 @@ StateRoot: Cid("zDPWYqFD5abn4FyknPm1PibXdJ2kwRNVPDabKyzfdXVJGjnDuq4B") Messages: []SignedMessage{} MessageReceipts: []MessageReceipt{} ``` + +## RLE+ Bitset Encoding + +RLE+ is a lossless compression format based on [RLE](https://en.wikipedia.org/wiki/Run-length_encoding). +It's primary goal is to reduce the size in the case of many individual bits, where RLE breaks down quickly, +while keeping the same level of compression for large sets of contiugous bits. + +In tests it has shown to be more compact than RLE iteself, as well as [Concise](https://arxiv.org/pdf/1004.0403.pdf) and [Roaring](https://roaringbitmap.org/). + +### Format + +The format consists of a header, followed by a series of blocks, of which there are three different types. + +The format can be expressed as the following [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) grammar. + +``` + ::=
+
::= + ::= "00" + ::= | | + ::= "1" + ::= "01" + ::= "00" + ::= "0" | "1" +``` + +An `` is defined as specified [here](https://github.com/multiformats/unsigned-varint). + +#### Header + +The header indiciates the very first bit of the bit vector to encode. This means the first bit is always +the same for the encoded and non encoded form. + +#### Blocks + +The blocks represent how many bits, of the current bit type there are. As `0` and `1` alternate in a bit vector +the inital bit, which is stored in the header, is enough to determine if a length is currently referencing +a set of `0`s, or `1`s. + +##### Block Single + +If the running length of the current bit is only `1`, it is encoded as a single set bit. + +##### Block Short + +If the running length is less than `16`, it can be encoded into up to four bits, which a short block +represents. The length is encoded into a 4 bits, and prefixed with `01`, to indicate a short block. + +##### Block Long + +If the running length is `16` or larger, it is encoded into a varint, and then prefixed with `00` to indicate +a long block. + + +> **Note:** The encoding is unique, so no matter which algorithm for encoding is used, it should produce +> the same encoding, given the same input.