[Spec] ascii optimized enumerated string encoding algorithm #1243

chaokunyang · 2023-12-22T08:12:20Z

Is your feature request related to a problem? Please describe.

Enumerated string are meta string such as class name, field name, enum value, such sting are limited and we can cache their encoding result in memory. So we can use more time to encode such strings for smaller size.

Describe the solution you'd like

Here we propose a new numerated string encoding format: the format consists of header and binary.

Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data.

Header

Write by data

If string hasn't been written before, the data will be written as follows:

| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary |

Murmur hash can be omitted if caller pass a flag. In such cases, the format will be:

| unsigned int: string binary size + 1bit: not written before | 8 bits encoding flags | string binary |

5 bits in 8 bits encoding flags will be left empty.

Encoding flags:

Encoding Flag	Pattern	Encoding Action
0	every char is in `a-z._$\|`	`LOWER_SPECIAL`
1	every char is in `a-z._$` except first char is upper case	replace first upper case char to lower case, then use `LOWER_SPECIAL`
2	every char is in `a-zA-Z._$`	replace every upper case char by `\|` + `lower case`, then use `LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3`
3	every char is in `a-zA-Z._$`	use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than Encoding `2`
4	any utf-8 char	use `UTF-8` encoding

Write by ref

If string has been written before, the data will be written as follows:

| unsigned int: written string id + 1bit: written before |

String binary

String binary encoding:

Algorithm	Pattern	Description
LOWER_SPECIAL	`a-z._$\|`	every char is writen using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`
LOWER_UPPER_DIGIT_SPECIAL	`a-zA-Z0~9._$`	every char is writen using 6 bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._$`: `0b111110~0b1000000`
UTF-8	any chars	UTF-8 encoding

Additional context

#1238

The text was updated successfully, but these errors were encountered:

chaokunyang added the enhancement New feature or request label Dec 22, 2023

chaokunyang changed the title ~~[Format] ascii optimized enumerated string encoding algorithm~~ [Spec] ascii optimized enumerated string encoding algorithm Dec 22, 2023

This was referenced Dec 22, 2023

[Doc][Spec] fury java serialization spec #1238

Merged

[Java][Doc] Java object graph specification #1239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec] ascii optimized enumerated string encoding algorithm #1243

[Spec] ascii optimized enumerated string encoding algorithm #1243

chaokunyang commented Dec 22, 2023

[Spec] ascii optimized enumerated string encoding algorithm #1243

[Spec] ascii optimized enumerated string encoding algorithm #1243

Comments

chaokunyang commented Dec 22, 2023

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Header

Write by data

Write by ref

String binary

Additional context