You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Enumerated string are meta string such as class name, field name, enum value, such sting are limited and we can cache their encoding result in memory. So we can use more time to encode such strings for smaller size.
Describe the solution you'd like
Here we propose a new numerated string encoding format: the format consists of header and binary.
Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data.
Header
Write by data
If string hasn't been written before, the data will be written as follows:
| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary |
Murmur hash can be omitted if caller pass a flag. In such cases, the format will be:
| unsigned int: string binary size + 1bit: not written before | 8 bits encoding flags | string binary |
5 bits in 8 bits encoding flags will be left empty.
Encoding flags:
Encoding Flag
Pattern
Encoding Action
0
every char is in a-z._$|
LOWER_SPECIAL
1
every char is in a-z._$ except first char is upper case
replace first upper case char to lower case, then use LOWER_SPECIAL
2
every char is in a-zA-Z._$
replace every upper case char by | + lower case, then use LOWER_SPECIAL, use this encoding if it's smaller than Encoding 3
3
every char is in a-zA-Z._$
use LOWER_UPPER_DIGIT_SPECIAL encoding if it's smaller than Encoding 2
4
any utf-8 char
use UTF-8 encoding
Write by ref
If string has been written before, the data will be written as follows:
| unsigned int: written string id + 1bit: written before |
String binary
String binary encoding:
Algorithm
Pattern
Description
LOWER_SPECIAL
a-z._$|
every char is writen using 5 bits, a-z: 0b00000~0b11001, ._$|: 0b11010~0b11101
LOWER_UPPER_DIGIT_SPECIAL
a-zA-Z0~9._$
every char is writen using 6 bits, a-z: 0b00000~0b11110, A-Z: 0b11010~0b110011, 0~9: 0b110100~0b111101, ._$: 0b111110~0b1000000
Is your feature request related to a problem? Please describe.
Enumerated string are meta string such as class name, field name, enum value, such sting are limited and we can cache their encoding result in memory. So we can use more time to encode such strings for smaller size.
Describe the solution you'd like
Here we propose a new numerated string encoding format: the format consists of header and binary.
Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data.
Header
Write by data
If string hasn't been written before, the data will be written as follows:
Murmur hash can be omitted if caller pass a flag. In such cases, the format will be:
5 bits in
8 bits encoding flags
will be left empty.Encoding flags:
a-z._$|
LOWER_SPECIAL
a-z._$
except first char is upper caseLOWER_SPECIAL
a-zA-Z._$
|
+lower case
, then useLOWER_SPECIAL
, use this encoding if it's smaller than Encoding3
a-zA-Z._$
LOWER_UPPER_DIGIT_SPECIAL
encoding if it's smaller than Encoding2
UTF-8
encodingWrite by ref
If string has been written before, the data will be written as follows:
String binary
String binary encoding:
a-z._$|
a-z
:0b00000~0b11001
,._$|
:0b11010~0b11101
a-zA-Z0~9._$
a-z
:0b00000~0b11110
,A-Z
:0b11010~0b110011
,0~9
:0b110100~0b111101
,._$
:0b111110~0b1000000
Additional context
#1238
The text was updated successfully, but these errors were encountered: