Version: 1.0.0
This document is a specification for the storage format of the Aggregate Knowledge HyperLogLog implementations.
A hll
is a combination of different set/distinct-value-counting algorithms that can be thought of as a hierarchy, along with rules for moving up that hierarchy. In order to distinguish between said algorithms, we have given them names:
A constant value that denotes the empty set.
An explicit, unique, sorted list of integers in the set, which is maintained up to a fixed cardinality.
A 'lazy', map-based implementation of HyperLogLog, a probabilistic set data structure. Only stores the indices and values of non-zero registers in a map, until the number of non-zero registers exceeds a fixed cardinality.
A fully-materialized, list-based implementation of HyperLogLog. Explicitly stores the value of every register in a list ordered by register index.
For brevity's sake, we'll refer to these different algorithms as 'types'.
These data structures are stored as zero-indexed array of bytes with a schema version specified in the top nibble of the zero-th byte. The schema version decides both the specific storage layout as well as the rules for the interaction of the different types mentioned above.
The storage specification can trivially be converted into a serialization format by simply encoding the arrays of bytes. The storage specification includes how to represent the various set-like structures as arrays of bytes, as well as how to represent the results of operations on multiple sets (such as an undefined result).
NOTE: all examples and diagrams will use a hex format with the high nibble first (on the left) in the octet, and the zero-th octet on the left.
In a pseudo-regex format, the storage layout of the bytes is as follows: ^VS*$
^
symbolizes the beginning of the array of bytes, right before the zero-indexed byteV
is the 'version' byte, indicating the schema version in the top nibble. The bottom nibble is schema version-dependent.S
is a byte specified by the particular schema version. See below.$
symbolizes the end of the array of bytes, right after the highest-indexed byte
Each schema version will elaborate on the above layout.
-
Size/Layout
- Size/Layout
-
Minimum length: 3 bytes
-
Maximum length: 17,179,869,187 bytes ( = (231 * 8) + 3)
-
The schema-specific bytes
S*
have the following structure:PCB*
P
is the required 'parameter' byte. See 'Type-agnostic' subsection.C
is the required 'cutoff' byte. See 'Type-agnostic' subsection.B
is a 'data' byte. The 'data' bytes layout is specified per type. See 'Type-specific' subsection.
-
Type-agnostic
-
The bottom nibble of the 'version' byte
V
indicates the type, as defined by these ordinals:0
- undefined1
-EMPTY
2
-EXPLICIT
3
-SPARSE
4
-FULL
-
The 'parameter' byte
P
encodes the bit-width and number of registers used bySPARSE
andFULL
types.- the highest 3 bits are used to encode the integer value
registerWidth - 1
, and - the remaining 5 bits encode the integer value
log2(numberOfRegisters)
.
- the highest 3 bits are used to encode the integer value
registerWidth
may take values from 1 to 8, inclusive, andlog2(numberOfRegisters)
may take on 4 to 31, inclusive.For example:
P = xA6 = 1010 0110 = 101 00110
thus
registerWidth - 1 = 5 registerWidth = 6
and
log2(numberOfRegisters) = 6 numberOfRegisters = 2^6 = 64
-
The 'cutoff' byte
C
encodes parameters defining theEXPLICIT
toSPARSE
, andSPARSE
toFULL
promotions.- 1 bit (the top bit) of padding,
- 1 bit (second highest bit) indicating the boolean value
sparseEnabled
, and - 6 bits (lowest six bits) as an unsigned big-endian integer
explicitCutoff
that can take on the values0
,63
, or1
to31
inclusive.
-
-
Type-specific
-
undefined/invalid result
Uses no data bytes.
-
EMPTY
Uses no data bytes.
-
EXPLICIT
The layout of the 'data' bytes is:
(B{8}){0,256}
, that is 0-256 blocks of 8 bytes. Each block of 8 bytes represent a signed 64-bit integer (sign bit + 63 value bits). These integers are encoded as big-endian (with sign-bit at highest position), and are the "contents" of the set. (The blocks together represent an array of such integers, stored in ascending order, without duplicates.)data bytes = [-5451491901947305642, 1] # as ascending, signed, decimal-encoded 64-bit values in array = [0xCBA79700677CDEAA, 0x0000000000000001] # as hex-encoded 64-bit values in array = 0xCBA79700677CDEAA0000000000000001 # as hex
-
SPARSE
The data bytes encode the register indices and values in
(log2(numberOfRegisters) + registerWidth)
-bit-wide "short-words". Each short-word is a bit-packed registerindex
/registervalue
pair.- The
index
is encoded in the highestlog2(numberOfRegisters)
bits of the short-word. - The
value
is encoded in the lowestregisterWidth
bits of the short-word.
The short-words are packed into bytes from the top of the zero-th data byte to the bottom of the last data byte, with the high bits of a short-word toward the high bits of a byte.
- If
BITS = (registerWidth + log2(numberOfRegisters)) * numberOfRegisters
is not divisible by 8, thenBITS % 8
padding bits are added to the bottom of the last byte of the array. - The short-words are stored in ascending
index
order.
For example, if
log2(numberOfRegisters) = 11
andregisterWidth = 6
, and if the register index/value pairs are(11, 6)
and(1099, 19)
:= [(11, 6), (1099, 19), padding] # as unsigned decimal-encoded pairs = [(0b00000001011, 0b000110), (0b10001001011, 0b010011), 0b000000] # as two binary-encoded pairs and 6 bits of padding = [(0b00000001011000110), (0b10001001011010011), 0b000000] # as binary-encoded 17-bit short words and 6 bits of padding = [0b00000001, 0b01100011, 0b01000100, 0b10110100, 0b11000000] # as binary-encoded octets in array = [0x01, 0x63, 0x44, 0xB4, 0xC0] # as byte array 0X016344B4C0 # as hex
This encoding was chosen so that when reading bytes as octets in the typical first-octet-is-the-high-4-bits fashion, a octet-to-binary conversion as demonstrated above would yield a high-to-low, left-to-right view of the "short words" and their components.
- The
-
FULL
The data bytes encode the register values in
registerWidth
-bit-wide, big-endian "short-words". The short words are written from the top of the zero-th byte of the array to the bottom of the last byte of the array, with the high bits of a short-word toward the high bits of a byte.- If
BITS = registerWidth * numberOfRegisters
is not divisible by 8, thenBITS % 8
padding bits are added to the bottom of the last byte of the array. - The short-words are stored in ascending index order.
For example, if
registerWidth = 5
andnumberOfRegisters = 4
, and if the register index/value pairs are(0, 0), (1,1), (2,2), (3,3)
:[0, 1, 2, 3, padding] # as unsigned decimal-encoded register values = [0b00000, 0b00001, 0b00010, 0b00011, 0b0000] # as four 5-bit "short words" + 4 bits padding = [0b00000000, 0b01000100, 0b00110000] # as binary-encoded octets in array = [0x00, 0x44, 0x30] # as hex-encoded byte array = 0x004430 # as hex
- If
-
-
Hierarchy
-
The hierarchy is dependent on the 'cutoff' byte
C
. When a set is promoted from one 'type'/algorithm to another, the top nibble of the 'version' byteV
, the 'parameter' byteP
, and the 'cutoff' byteC
all remain the same. The bottom nibble ofV
and the 'data' bytesB
may change. -
When any value is added to an
EMPTY
set,- if
explicitCutoff = 0
andsparseEnabled = 0
, then it is promoted to aFULL
set containing that one value. - if
explicitCutoff = 0
andsparseEnabled = 1
, then it is promoted to aSPARSE
set containing that one value. - if
explicitCutoff > 0
but< 63
, then it is promoted to anEXPLICIT
set containing that one value.
- if
-
When inserting an element into an
EXPLICIT
set,- if
sparseEnabled = 0
andexplicitCutoff = 0
, then it is promoted to aFULL
set. - if
sparseEnabled = 1
andexplicitCutoff = 0
, then it is promoted to aSPARSE
set. - if
sparseEnabled = 0
andexplicitCutoff > 0
and< 63
, and if inserting the element would cause the cardinality to exceed2 ^ (explicitCutoff - 1)
, then it is promoted to aFULL
. - if
sparseEnabled = 1
andexplicitCutoff > 0
and< 63
, and if inserting the element would cause the cardinality to exceed2 ^ (explicitCutoff - 1)
, then it is promoted to aSPARSE
. - if
sparseEnabled = 0
andexplicitCutoff = 63
, then the criteria for promotion is implementation-dependent, as this value ofexplicitCutoff
indicates an 'auto' promotion mode. SincesparseEnabled = 0
the set can only be promoted to aFULL
set. - if
sparseEnabled = 1
andexplicitCutoff = 63
, then the promotion is implementation-dependent, as this value ofexplicitCutoff
indicates an 'auto' promotion mode. SincesparseEnabled = 1
the set can only be promoted to aSPARSE
set.
- if
-
When inserting an element into a
SPARSE
set, if that element would cause the storage size of theSPARSE
set to be greater than that of aFULL
set, then it is promoted to aFULL
set.
-