From 94d0b193b84227c7d27c8a7478164e2946d5ea42 Mon Sep 17 00:00:00 2001 From: dignifiedquire Date: Fri, 3 May 2019 17:58:09 +0200 Subject: [PATCH 1/3] add rle+ datastructure --- data-structures.md | 59 ++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 57 insertions(+), 2 deletions(-) diff --git a/data-structures.md b/data-structures.md index 9e7027db6..7a70905b3 100644 --- a/data-structures.md +++ b/data-structures.md @@ -24,7 +24,7 @@ To learn more, take a look at the [Address Spec](https://github.com/filecoin-pro ## CID -For most objects referenced by Filecoin, a Content Identifier (CID for short) is used. This is effectively a hash value, prefixed with its hash function (multihash) prepended with a few extra labels to inform applications about how to deserialize the given data. To learn more, take a look at the [CID Spec](https://github.com/ipld/cid). +For most objects referenced by Filecoin, a Content Identifier (CID for short) is used. This is effectively a hash value, prefixed with its hash function (multihash) prepended with a few extra labels to inform applications about how to deserialize the given data. To learn more, take a look at the [CID Spec](https://github.com/ipld/cid). CIDs are serialized by applying binary multibase encoding, then encoding that as a CBOR byte array with a tag of 42. @@ -68,7 +68,7 @@ type Block struct { // MessageReceipts is a set of receipts matching to the sending of the `Messages`. // TODO: should be the same type of merkletree-list thing that the messages are MessageReceipts []MessageReceipt - + // The block Timestamp is used to enforce a form of block delay by honest miners. // Unix time UTC timestamp stored as an unsigned integer Timestamp Timestamp @@ -363,3 +363,58 @@ StateRoot: Cid("zDPWYqFD5abn4FyknPm1PibXdJ2kwRNVPDabKyzfdXVJGjnDuq4B") Messages: []SignedMessage{} MessageReceipts: []MessageReceipt{} ``` + +## RLE+ Bitset Encoding + +RLE+ is a lossless compression format based on [RLE](https://en.wikipedia.org/wiki/Run-length_encoding). +It's primary goal is to reduce the size in the case of many individual bits, where RLE breaks down quickly, +while keeping the same level of compression for large sets of contigous bits. + +In tests it has shown to be more compact than RLE iteself, as well as [Concise](https://arxiv.org/pdf/1004.0403.pdf) and [Roaring](https://roaringbitmap.org/). + +### Format + +The format consists of a header, followed by a series of blocks, of which there are three different types. + +The format can be expressed as the following [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) grammar. + +``` + ::=
+
::= + ::= | | + ::= "1" + ::= "01" + ::= "00" + ::= "0" | "1" +``` + +An `` is defined as specified [here](https://github.com/multiformats/unsigned-varint). + +#### Header + +The header indiciates the very first bit of the bit vector to encode. This means the first bit is always +the same for the encoded and non encoded form. + +#### Blocks + +The blocks represent how many bits, of the current bit type there are. As `0` and `1` alternate in a bit vector +the inital bit, which is stored in the header, is enough to determine if a length is currently referencing +a set of `0`s, or `1`s. + +##### Block Single + +If the running length of the current bit is only `1`, it is encoded as a single set bit. + +##### Block Short + +If the running length is less than `16`, it can be encoded into up to four bits, which a short block +represents. The length is encoded into a 4 bits, and prefixed with `01`, to indicate a short block. + +##### Block Long + +If the running length is `16` or larger, it is encoded into a varint, and then prefixed with `00` to indicate +a long block. + + +> **Note:** The encoding is unique, so no matter which algorithm for encoding is used, it should produce +> the same encoding, given the same input. From e6a17c5a1184c24229eea746e2d9bf0da61bab8f Mon Sep 17 00:00:00 2001 From: dignifiedquire Date: Sat, 11 May 2019 19:40:41 +0100 Subject: [PATCH 2/3] add version bits --- data-structures.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/data-structures.md b/data-structures.md index 7a70905b3..6d70583fe 100644 --- a/data-structures.md +++ b/data-structures.md @@ -380,7 +380,8 @@ The format can be expressed as the following [BNF](https://en.wikipedia.org/wiki ``` ::=
-
::= +
::= + ::= "00" ::= | | ::= "1" ::= "01" From a0e3c1b71523ad4310d0433e0ee39d28056e84da Mon Sep 17 00:00:00 2001 From: dignifiedquire Date: Sat, 11 May 2019 19:41:33 +0100 Subject: [PATCH 3/3] fix typo --- data-structures.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data-structures.md b/data-structures.md index 6d70583fe..1a124aafd 100644 --- a/data-structures.md +++ b/data-structures.md @@ -368,7 +368,7 @@ MessageReceipts: []MessageReceipt{} RLE+ is a lossless compression format based on [RLE](https://en.wikipedia.org/wiki/Run-length_encoding). It's primary goal is to reduce the size in the case of many individual bits, where RLE breaks down quickly, -while keeping the same level of compression for large sets of contigous bits. +while keeping the same level of compression for large sets of contiugous bits. In tests it has shown to be more compact than RLE iteself, as well as [Concise](https://arxiv.org/pdf/1004.0403.pdf) and [Roaring](https://roaringbitmap.org/).