Skip to content

tuple serialization format

convolvatron edited this page Sep 20, 2018 · 13 revisions

tuples

tuples represent metadata in uniboot. a tuple is a map from utf8 strings (buffers) to a union type containing tuples and buffers. tuples are almost exactly like JSON objects except there is no representation for number, and tuples are allowed to contain cycle references.

in the runtime, each tuple value is associated with a dynamic runtime type (not entirely true, buffer values aren't tagged, which we should fix)

keys which do not exist in a given tuple are assigned a default value of the zero-length buffer

tuples are currently used for filesystem metadata, and stage3 kernel configuration. work is outstanding to integrate heap usage statistics. anticipated future uses are for general remote monitoring and management, distributed storage and transactional storage.

serialization

tuple serialization is used to append data into the filesystem log, and eventually as a distributed monitoring and control protocol. the serialization is tied to the notion of an ordered stream to allow for inter-object references. current examples include the filesystem transaction log and remote tcp connections.

the current reference for the tuple serialization format is in uniboot/runtime/tuple.c

varint encoding

for internal values, the tuple serialization uses an integer encoding where each chunk (nominally a byte)of the value contains an additional bit (in the most significant position), which indicates whether there are additional chunks of bits to be included in the value.

for each chunk of size N, the implementation:

  • shifts the current value left by (N-1)
  • ors the lowermost N-1 bits into the result.
  • decodes the next chunk of the highest bit (N) is a 1
  • returns the value otherwise

its an entirely arbitrary choice as to whether the first chunk encodes the most or least significant bits, currently we choose most.

for example:

  • 5 is encoded as 00001001
  • 129 is encoded as 10000001 00001001

headers

each value in the tuple serialization is started by a common header value. the header contains 3 fields: o type (1 bit, either tuple = 1 or buffer =0 o immediate (1 bit, immediate = 1, reference = 0) o a length (varint as above, where the first chunk is the 6 remaining lower bits of the first header byte, and subsequent bytes thereafter)

7 6 5 4 3 2 1 0
type reference length continuation len4 len3 len2 len1 len0

followed by zero or more:

7 6 5 4 3 2 1 0
length continuation len6 len5 len4 len3 len2 len1 len0

which implies that headers with lengths under 32 can be encoded in a single byte

references

if the header reference bit is set, then the header refers to an object which has already been seen. references :

o provide a first level compression, so that things like keys aren't
  repeated 

o allow mutations to existing tuples within a stream to be represented.

o allow representation of cyclic tuple graphs

both the encoder and the decoder keep a dictionary of mappings between stream indices and runtime objects. on the sender side this is from objects to indices, and on the receiver side for in

there is an open issue as to whether it would be helpful to maintain multiple sets of dictionary indices based on the type. this would result in a slightly more compact representation and provide some internal consistency checking. sets could be just tuples and buffers, but keys could also be maintained disjointly.

it may also be fruitful to waste another bit to exclude an immediate value from being tracked as a reference in some usages.

tuple encoding

the serialization for tuples is a header:

(tuple(1), immediate(1), number of values V) 
  or
(tuple(1), reference(0), dictionary_index) : number of values V : [values]

for each value in V the streams contains a key name (a buffer, immediate or reference), and a value which is either a tuple or a buffer.

deletion of a key is accomplished by setting the key to the empty buffer

buffers

buffer headers are either:

(buffer(0), immedate(1), length of the string) : string bytes or (buffer(0), reference(0), dictionary index1)

numbers

numbers are currently encoded and decoded as base 10 utf8 strings. it may be more convenient to encode them variable length integers