Zipack is a little bit complicated in comparison to other binary format like MessagePack and BSON. I imported some great ideas into zipack to replace UTF-8 and IEEE Floating Point. That's why I call zipack "the perfect encoding" with confidence.
Serialization is conversion from application objects into Zipack types. For example, JavaScript's Array should be converted into Zipack's List. Parsing is inversed process of serialization.
The big-endian VLQ is a technique to encode variable-length quantities on top of 7 bits, it uses the most significant bit of every byte to indicate the continuation of current data. 0xxxxxxx stands for the last byte, 1xxxxxxx means there are bytes to come. The decoder won't stop until 0xxxxxxx, different length of VLQ shall be like this:
- 0xxxxxxx
- 1xxxxxxx 0xxxxxxx
- 1xxxxxxx 1xxxxxxx 0xxxxxxx
- ......
The key utility in zipack is called the biased VLQ natural number which is implemented by Git and of cause Zipack, we call it VLQ's Natural for short. So, the VLQ Natural is the bijective numeration for natural numbers(0,1,2...). The mapping between VLQ and natural numbers is listed below:
natural(decimal) | base128 VLQ (binary) |
---|---|
0 | 0000000 |
1 | 0000001 |
...... | ...... |
127 | 1111111 |
128 | 0000000 0000000 |
129 | 0000000 0000001 |
...... | ...... |
16511 | 1111111 1111111 |
Make sure u understand this concept cause Zipack is heavily based on VLQ's Natural
I abandoned UTF-8 for better compression. Since every Unicode character has an unique number from 0, so Zipack String maps Unicode to VLQ's natural. For example, char 'A' in Unicode is 65, so it's Zipack format is 01000001. Zipack String is used in 2 ways: the String type and the key in Map.
I abandoned IEEE's Floating Point for better compression. Literally every non-integer can be splited by dot(.) into 2 parts: integer part and precision part. Since the trailing '0' in precision part is meaningless(e.g. 0.120=0.12), the heading '0' in natural number is also meaningless(e.g. 012=12). So when we reverse every bit of the precision part, the outcome can be uniquely mapped to a natural number(counting from 1, because 0 is integer). Here is an example encoding 110.0101 in binary:
- remove meaningless '0' in both sides, no changes
- split "110.0101" into "110" and "0101"
- encode 110 into a VLQ's Natural (A)
- reverse "0101" into "1010"
- 1010 - 1 = 1001
- encode 1001 into a VLQ's Natural (B)
- concat A and B seamlessly and output (AB)
Zipack supports 21 data types(including 6 reserved types), their definition is tree.km. Generally every data type contains 3 parts: _CLASS, _LENGTH and _PAYLOAD. Here is their definition table:
_CLASS | _LENGTH | length of | _PAYLOAD | |
---|---|---|---|---|
Mini Natural | 0 | none | \ | 7-bit (0~127) |
VLQ +Int | 1111 1000 | none | \ | VLQ's natural + 128 |
VLQ -Int | 1111 1001 | none | \ | -1 - VLQ's natural |
+Precision | 1111 0010 | none | \ | Zipack's Precision |
-Precision | 1111 0011 | none | \ | Zipack's Precision |
Mini String | 100 | 5-bit (0~31) | count of chars | Zipack's String |
VLQ String | 1111 0101 | VLQ's natural + 32 | count of chars | Zipack's String |
(key in Map) | none | VLQ's natural | count of key's chars | Zipack's String |
Bytes | 1111 0100 | VLQ's natural | count of bytes | the pure bytes |
Boolean | 1111 0000, 1111 0001 | none | \ | \ |
Null | 1111 1010 | none | \ | \ |
Mini List | 101 | 5-bit (0~31) | count of elements | List elements |
VLQ List | 1111 0110 | VLQ's natural + 32 | count of elements | List elements |
Mini Map | 110 | 5-bit (0~31) | count of key-value pairs | key-value pairs |
VLQ Map | 1111 0111 | VLQ's natural + 32 | count of key-value pairs | key-value pairs |
reserved types | FB, FC, FD, FE, FF | \ | \ | \ |
reserved types | 1110 | 4-bit(0~15) | count of objects | countable objects |
a single byte:
+--------+
| |
+--------+
zero or more bytes:
+--------+
| ....
+--------+
a VLQ's Natural:(contains one or more bytes)
+========+
| |
+========+
zero or more VLQ's Natural:
+========+
| ....
+========+
an element that can be any of basic/complex types:
++++++++++
| |
++++++++++
zero or more elements:
++++++++++
| ....
++++++++++
a key-value pair:
+########+
| |
+########+
zero or more key-value pairs:
+########+
| ....
+########+
The Basic types refer to True, False, Null, Integer family, Precision family, String family, Bytes.
True:
+--------+
| F0 |
+--------+
False:
+--------+
| F1 |
+--------+
Null:
+--------+
| FA |
+--------+
Mini Natural: stores 7-bit unsigned integer ranging from 0~127, in a single byte.
+--------+
|0xxxxxxx|
+--------+
VLQ +Int: stores a positive integer on top of Mini Natural(counting from 128), uses a VLQ's Natural to represent the value.
+--------+========+
| F8 | +128 |
+--------+========+
VLQ -Int: stores a negative integer counting from -1.
+--------+========+
| F9 | -1 |
+--------+========+
xxxxxxx is a 7-bit natural unsigned integer.
+128 means the real value of this VLQ should be added 128.
-1 means the real value of this VLQ should be added 1 and negative it.
As I mentioned before, Zipack's non-integer number is encoded using the "reversed bijective" algorithm instead of standard IEEE FLoating Point. A Zipack's Precision number can be stored by 2 VLQ's Natural, one for Integer part(left), the other for Precision part(right).
+Precision: stores a positive non-integer.
+--------+========+========+
| F2 | left | right |
+--------+========+========+
-Precision: stores a negative non-integer.
+--------+========+========+
| F3 | left | right |
+--------+========+========+
left: the integer part.
right: the precision part counting from 1.
Bytes stores pure binary data like files, images, etc. Zipack has NO "Mini Bytes" for this type usually stores large data. It need a VLQ's Natural to represent the count of bytes.
+--------+========+--------+
| F4 | +0 | ....
+--------+========+--------+
+0 means the real value of this VLQ is the count of pure bytes.
As I mentioned before, Zipack's String is based on VLQ instead of standard UTF-8. A Unicode char can be encoded by a VLQ's Natural, so String is seamlessly concatenated chars.
Mini String: stores a Unicode string whose length < 32.
+--------+========+
|100xxxxx| ....
+--------+========+
VLQ String: stores a Unicode string whose length >= 32, the _LENGTH part should count from 32.
+--------+========+========+
| F5 | +32 | ....
+--------+========+========+
xxxxx is a 5-bit unsigned integer representing the count of chars.
+32 means the real value of this VLQ representing the count of chars should be added 32.
Complex types refer to List, Map and reserved types. In comparison to basic types, complex types can nest other basic/complex types like a tree.
Similar to String, Zipack List also defers Mini List and VLQ List.
Mini List: stores a list whose length < 32.
+--------++++++++++
|101xxxxx| ....
+--------++++++++++
VLQ List: stores a list whose length >= 32, the _LENGTH part should count from 32.
+--------+========++++++++++
| F6 | +32 | ....
+--------+========++++++++++
xxxxx is a 5-bit unsigned integer representing the count of elements.
+32 means the real value of this VLQ representing the count of elements should be added 32.
Zipack Map is key-value pairs where the key is a string and the value can be any of Zipack's basic/complex types. Similar to String and List, Map also defers Mini Map and VLQ Map.
key-value pair: stores a string without prefix and an element.
+========+============++++++++++
| +0 | .... | |
+========+============++++++++++
or:
+########+
| |
+########+
Mini Map: stores a map where the count of key-value pairs < 32.
+--------+########+
|110xxxxx| ....
+--------+########+
VLQ Map: stores a map where the count of key-value pairs >= 32.
+--------+========+########+
| F7 | +32 | ....
+--------+========+########+
+0 means the real value of this VLQ represents the length of key.
xxxxx is a 5-bit unsigned integer representing the count of key-value pairs.
+32 means the real value of this VLQ representing the count of key-value pairs should be added 32.
Zipack reserves 6 undefined prefix which can be extended by users:
- 1110
- 1111 1011
- 1111 1100
- 1111 1101
- 1111 1110
- 1111 1111
Zipack Stream is like List, but without prefix(_CLASS and _LENGTH part).
Zipack stream: one or more zipack object concatenated seamlessly.
++++++++++
| ....
++++++++++
-
The reserved types can be used to implement TypedList and OrderedMap which are popular today. TypedList is List with schema; OrderedMap is Map whose key-value pairs has no order. In the future we shall provide guideline for reserved types.
-
Zipack should also act as config file like JSON and YAML, but binary file can not be edited directly. We need a GUI tool to handle this, it can be an editor or a plugin inside IDEs and web browsers.