From 94d0b193b84227c7d27c8a7478164e2946d5ea42 Mon Sep 17 00:00:00 2001
From: dignifiedquire <dignifiedquire@users.noreply.github.com>
Date: Fri, 3 May 2019 17:58:09 +0200
Subject: [PATCH 1/3] add rle+ datastructure

---
 data-structures.md | 59 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)
diff --git a/data-structures.md b/data-structures.md
index 9e7027db6..7a70905b3 100644
--- a/data-structures.md
+++ b/data-structures.md
@@ -24,7 +24,7 @@ To learn more, take a look at the [Address Spec](https://github.com/filecoin-pro
 
 ## CID
 
-For most objects referenced by Filecoin, a Content Identifier (CID for short) is used. This is effectively a hash value, prefixed with its hash function (multihash) prepended with a few extra labels to inform applications about how to deserialize the given data. To learn more, take a look at the [CID Spec](https://github.com/ipld/cid). 
+For most objects referenced by Filecoin, a Content Identifier (CID for short) is used. This is effectively a hash value, prefixed with its hash function (multihash) prepended with a few extra labels to inform applications about how to deserialize the given data. To learn more, take a look at the [CID Spec](https://github.com/ipld/cid).
 
 CIDs are serialized by applying binary multibase encoding, then encoding that as a CBOR byte array with a tag of 42.
 
@@ -68,7 +68,7 @@ type Block struct {
 	// MessageReceipts is a set of receipts matching to the sending of the `Messages`.
 	// TODO: should be the same type of merkletree-list thing that the messages are
 	MessageReceipts []MessageReceipt
-    
+
     // The block Timestamp is used to enforce a form of block delay by honest miners.
     // Unix time UTC timestamp stored as an unsigned integer
     Timestamp Timestamp
@@ -363,3 +363,58 @@ StateRoot:       Cid("zDPWYqFD5abn4FyknPm1PibXdJ2kwRNVPDabKyzfdXVJGjnDuq4B")
 Messages:        []SignedMessage{}
 MessageReceipts: []MessageReceipt{}
 ```
+
+## RLE+ Bitset Encoding
+
+RLE+ is a lossless compression format based on [RLE](https://en.wikipedia.org/wiki/Run-length_encoding).
+It's primary goal is to reduce the size in the case of many individual bits, where RLE breaks down quickly,
+while keeping the same level of compression for large sets of contigous bits.
+
+In tests it has shown to be more compact than RLE iteself, as well as [Concise](https://arxiv.org/pdf/1004.0403.pdf) and [Roaring](https://roaringbitmap.org/).
+
+### Format
+
+The format consists of a header, followed by a series of blocks, of which there are three different types.
+
+The format can be expressed as the following [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) grammar.
+
+```
+    <encoding> ::= <header> <blocks>
+      <header> ::= <bit>
+      <blocks> ::= <block_single> | <block_short> | <block_long>
+<block_single> ::= "1"
+ <block_short> ::= "01" <bit> <bit> <bit> <bit>
+  <block_long> ::= "00" <unsigned_varint>
+         <bit> ::= "0" | "1"
+```
+
+An `<unsigned_varint>` is defined as specified [here](https://github.com/multiformats/unsigned-varint).
+
+#### Header
+
+The header indiciates the very first bit of the bit vector to encode. This means the first bit is always
+the same for the encoded and non encoded form.
+
+#### Blocks
+
+The blocks represent how many bits, of the current bit type there are. As `0` and `1` alternate in a bit vector
+the inital bit, which is stored in the header, is enough to determine if a length is currently referencing
+a set of `0`s, or `1`s.
+
+##### Block Single
+
+If the running length of the current bit is only `1`, it is encoded as a single set bit.
+
+##### Block Short
+
+If the running length is less than `16`, it can be encoded into up to four bits, which a short block
+represents. The length is encoded into a 4 bits, and prefixed with `01`, to indicate a short block.
+
+##### Block Long
+
+If the running length is `16` or larger, it is encoded into a varint, and then prefixed with `00` to indicate
+a long block.
+
+
+> **Note:** The encoding is unique, so no matter which algorithm for encoding is used, it should produce
+> the same encoding, given the same input.

From e6a17c5a1184c24229eea746e2d9bf0da61bab8f Mon Sep 17 00:00:00 2001
From: dignifiedquire <dignifiedquire@users.noreply.github.com>
Date: Sat, 11 May 2019 19:40:41 +0100
Subject: [PATCH 2/3] add version bits

---
 data-structures.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/data-structures.md b/data-structures.md
index 7a70905b3..6d70583fe 100644
--- a/data-structures.md
+++ b/data-structures.md
@@ -380,7 +380,8 @@ The format can be expressed as the following [BNF](https://en.wikipedia.org/wiki
 
 ```
     <encoding> ::= <header> <blocks>
-      <header> ::= <bit>
+      <header> ::= <version> <bit>
+     <version> ::= "00"
       <blocks> ::= <block_single> | <block_short> | <block_long>
 <block_single> ::= "1"
  <block_short> ::= "01" <bit> <bit> <bit> <bit>

From a0e3c1b71523ad4310d0433e0ee39d28056e84da Mon Sep 17 00:00:00 2001
From: dignifiedquire <dignifiedquire@users.noreply.github.com>
Date: Sat, 11 May 2019 19:41:33 +0100
Subject: [PATCH 3/3] fix typo

---
 data-structures.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/data-structures.md b/data-structures.md
index 6d70583fe..1a124aafd 100644
--- a/data-structures.md
+++ b/data-structures.md
@@ -368,7 +368,7 @@ MessageReceipts: []MessageReceipt{}
 
 RLE+ is a lossless compression format based on [RLE](https://en.wikipedia.org/wiki/Run-length_encoding).
 It's primary goal is to reduce the size in the case of many individual bits, where RLE breaks down quickly,
-while keeping the same level of compression for large sets of contigous bits.
+while keeping the same level of compression for large sets of contiugous bits.
 
 In tests it has shown to be more compact than RLE iteself, as well as [Concise](https://arxiv.org/pdf/1004.0403.pdf) and [Roaring](https://roaringbitmap.org/).