diff --git a/docs/protocols/java_object_graph.md b/docs/protocols/java_object_graph.md index da7ee0ab64..47cd02eb2d 100644 --- a/docs/protocols/java_object_graph.md +++ b/docs/protocols/java_object_graph.md @@ -1,98 +1,259 @@ -# Java Serialization -The data are serialized using little endian order overall. +# Fury Java Serialization Specification + +## Spec overview + +The data are serialized using little endian order overall. If bytes swap is costly, the byte order will be encoded as a +flag in data. + +The overall format are: + +``` +| fury header | object ref meta | object class meta | object value data | +``` + +## Fury header + +Fury header consists starts one byte: + +``` +| resvered 4 bits | oob | xlang | endian | null | +``` + +- null flag: set when object is null, unset otherwise. If object is null, other bits won't be set. +- endian flag: set when system use little endian, unset otherwise. +- xlang flag: set when serialization uses xlang format, unset when serialization use Fury java format. +- oob flag: set when passed `BufferCallback` is not null, unset otherwise. + +If meta share mode is enabled, uncompressed little-endian 4 bytes is appended to indicate the start offset of meta data. + +## Reference Meta + +Reference tracking handles whether the object is null, and whether to track reference for the object by writing +corresponding flags and maintain internal state. + +Reference flags: + +| Flag | Byte Value | Description | +|---------------------|------------|-------------------------------------------------------------------------------------------------------------------------------| +| NULL FLAG | `-3` | This flag indicates that object is a not-null value. We don't use another byte to indicate REF, so that we can save one byte. | +| REF FLAG | `-2` | this flag indicates the object is written before, and fury will write a unsigned ref id instead of serialize it again | +| NOT_NULL VALUE FLAG | `-1` | this flag indicates that the object is a non-null value and fury doesn't track ref for this type of object. | +| REF VALUE FLAG | `0` | this flag indicates that the object is a referencable and first read. | + +When reference tracking is disabled globally or only for some type, or for some type under some context such as some +field of a class, only `NULL FLAG` and ` NOT_NULL VALUE FLAG` will be used. + +## Class Meta + +Depending on whether meta share mode is enabled, Fury will write class meta differently. + +### Schema consistent + +If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows: + +- If class is registered, it will be written as a little-endian unsigned int: `class_id << 1` using fury unsigned int + format. +- If class is not registered, fury will write one byte `0b1` first, the little bit is different first bit of encoded + class id, which is `0`. Fury can use this information to determine whether read class by class id. + - If meta share mode is enabled, class will be written as a unsigned int. + - If meta share mode is not enabled, class will be written as two enumerated string: + - package name. + - class name. + +### Schema evolution + +If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows: + +- If meta share mode is not enabled, class meta will be written as scheme consistent mode, field meta such as field type + and name will be written when the object value is being serialized using a key-value like layout. +- If meta share mode is enabled, class will be written as a unsigned int. + +## Meta share +> This mode will forbid streaming writing since it needs to look back for update the offset after the whole object graph +> writing and mete collecting is finished. +> TODO: We have plan to streamline meta writing but not started yet. + +### Schema consistent + + +### Schema evolution + + +## Enumerated String + +Enumerated string are mainly used to encode class name and field names. The format consists of header and binary. + +Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data. + +### Header + +#### Write by data + +If string hasn't been written before, the data will be written as follows: + +``` +| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary | +``` + +| Encoding Flag | Pattern | Encoding Action | +|---------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| +| 0 | every char is in `a-z._$\|` | `LOWER_SPECIAL` | +| 1 | every char is in `a-z._$` except first char is upper case | replace first upper case char to lower case, then use `LOWER_SPECIAL` | +| 2 | every char is in `a-zA-Z._$` | replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` | +| 3 | every char is in `a-zA-Z._$` | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than Encoding `2` | +| 4 | any utf-8 char | use `UTF-8` encoding | + +#### Write by ref +If string has been written before, the data will be written as follows: +``` +| unsigned int: written string id + 1bit: written before | +``` + +### String binary + +String binary encoding: + +| Algorithm | Pattern | Description | +|---------------------------|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------| +| LOWER_SPECIAL | `a-z._$\|` | every char is writen using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` | +| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is writen using 6 bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._$`: `0b111110~0b1000000` | +| UTF-8 | any chars | UTF-8 encoding | + +## Value Format + +### Basic types + +#### Bool -## Basic types -### boolean - size: 1 byte - format: 0 for `false`, 1 for `true` -### byte +#### Byte + - size: 1 byte - format: write as pure byte. -### short +#### Short + - size: 2 byte - byte order: little endian order -### char +#### Char + - size: 2 byte - byte order: little endian order -### int +#### Unsigned int + - size: 1~5 byte -- positive int format: first bit in every byte indicate whether has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit of next byte is unset. -- Negative number will be converted to positive number by ` (v << 1) ^ (v >> 31)` to reduce cost of small negative numbers. +- Format: first bit in every byte indicate whether to has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then + next byte should be read util first bit of next byte is unset. + +#### Signed int + +- size: 1~5 byte +- Format: First convert the number into positive unsigned int by `(v << 1) ^ (v >> 31)` ZigZag algorithm, then encoding + it as an unsigned int. + +#### Unsigned long + +- size: 1~9 byte +- Fury PVL(Progressive Variable-length Long) Encoding: + - positive long format: first bit in every byte indicate whether to has next byte. if first bit is set + i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit is unset. + +#### Signed long -### long - size: 1~9 byte - Fury SLI(Small long as int) Encoding: - - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` - - Otherwise write as 9 bytes: `| 0b1 | little-endian 8bytes long |` + - If long is in [-1073741824, 1073741823], encode as 4 bytes int: `| little-endian: ((int) value) << 1 |` + - Otherwise write as 9 bytes: `| 0b1 | little-endian 8bytes long |` - Fury PVL(Progressive Variable-length Long) Encoding: - - positive long format: first bit in every byte indicate whether has next byte. if first bit is set i.e. `b & 0x80 == 0x80`, then next byte should be read util first bit is unset. - - Negative number will be converted to positive number by ` (v << 1) ^ (v >> 63)` to reduce cost of small negative numbers. + - First convert the number into positive unsigned long by ` (v << 1) ^ (v >> 63)` ZigZag algorithm to reduce cost of + small negative numbers, then encoding it as an unsigned long. + +#### Float -### float - size: 4 byte - format: convert float to 4 bytes int by `Float.floatToRawIntBits`, then write as binary by little endian order. -### double +#### Double + - size: 8 byte - format: convert double to 8 bytes int by `Double.doubleToRawLongBits`, then write as binary by little endian order. -## String +### String + Format: + - one byte for encoding: 0 for `latin`, 1 for `utf-16`, 2 for `utf-8`. - positive varint for encoded string binary length. - encoded string binary data based on encoding: `latin/utf-16/utf-8`. Which encoding to choose: -- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise use `utf-16`. + +- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then use `latin` encoding, otherwise + use `utf-16`. - For JDK9+: fury use `coder` in `String` object for encoding, `latin`/`utf-16` will be used for encoding. -- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode the data. But currently fury doesn't enable utf-8 encoding by default for java. Cross-language string serialization of fury use `utf-8` by default. +- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode the data. But currently fury doesn't enable + utf-8 encoding by default for java. Cross-language string serialization of fury use `utf-8` by default. -## Array +### Collection -## Collection > All collection serializer must extends `io.fury.serializer.collection.CollectionSerializer`. Format: -```java -length(positive varint) | collection header | elements header | elements data + ``` +length(unsigned varint) | collection header | elements header | elements data +``` + +#### Collection header -### collection header - For `ArrayList/LinkedArrayList/HashSet/LinkedHashSet`, this will be empty. - For `TreeSet`, this will be `Comparator` - For subclass of `ArrayList`, this may be extra object field info. -### elements header -In most cases, all collection elements are same type and not null, elements header will encode those homogeneous -information to avoid the cost of writing it for every elements. Specifically, there are four kinds of information +#### Elements header + +In most cases, all collection elements are same type and not null, elements header will encode those homogeneous +information to avoid the cost of writing it for every elements. Specifically, there are four kinds of information which will be encoded by elements header, each use one bit: + - Whether track elements ref, use first bit `0b1` of header to flag it. -- Whether collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this -element type, this flag is invalid. -- Whether collection elements type is not declare type, use 3rd bit `0b100` of header to flag it. +- Whether collection has null, use second bit `0b10` of header to flag it. If ref tracking is enabled for this + element type, this flag is invalid. +- Whether collection elements type is not declare type, use 3rd bit `0b100` of header to flag it. - Whether collection elements type different, use 4rd bit `0b1000` of header to flag it. -By default, all bits are unset, which means all elements won't track ref, all elements are same type,, not null and the +By default, all bits are unset, which means all elements won't track ref, all elements are same type,, not null and the actual element is the declare type in custom class field. -### elements data +#### Elements data + Based on the elements header, the serialization of elements data may skip `ref flag`/`null flag`/`element class info`. `io.fury.serializer.collection.CollectionSerializer#write/read` can be taken as an example. -## Map - +### Array -## Object +#### Primitive array +#### Object array +### Map +### Enum +Enum are serialized as an +### Object +### Class +## Implementation guidelines +- Try to merge multiple bytes into an int/long write before writing to reduce memory IO and bound check cost. +- Read multiple bytes as an int/long, then spilt into multiple bytes to reduce memory IO and bound check cost. +- Try to use one varint/long to write flags and length together to save one byte cost and reduce memory io. +- Condition branch is less expensive compared to memory IO cost unless there are too much branches. \ No newline at end of file