Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spec] ascii optimized enumerated string encoding algorithm #1243

Open
chaokunyang opened this issue Dec 22, 2023 · 0 comments
Open

[Spec] ascii optimized enumerated string encoding algorithm #1243

chaokunyang opened this issue Dec 22, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@chaokunyang
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Enumerated string are meta string such as class name, field name, enum value, such sting are limited and we can cache their encoding result in memory. So we can use more time to encode such strings for smaller size.

Describe the solution you'd like

Here we propose a new numerated string encoding format: the format consists of header and binary.

Header are written using little endian order, Fury can read this flag first to determine how to deserialize the data.

Header

Write by data

If string hasn't been written before, the data will be written as follows:

| unsigned int: string binary size + 1bit: not written before | 61bits: murmur hash + 3 bits encoding flags | string binary |

Murmur hash can be omitted if caller pass a flag. In such cases, the format will be:

| unsigned int: string binary size + 1bit: not written before | 8 bits encoding flags | string binary |

5 bits in 8 bits encoding flags will be left empty.

Encoding flags:

Encoding Flag Pattern Encoding Action
0 every char is in a-z._$| LOWER_SPECIAL
1 every char is in a-z._$ except first char is upper case replace first upper case char to lower case, then use LOWER_SPECIAL
2 every char is in a-zA-Z._$ replace every upper case char by | + lower case, then use LOWER_SPECIAL, use this encoding if it's smaller than Encoding 3
3 every char is in a-zA-Z._$ use LOWER_UPPER_DIGIT_SPECIAL encoding if it's smaller than Encoding 2
4 any utf-8 char use UTF-8 encoding

Write by ref

If string has been written before, the data will be written as follows:

| unsigned int: written string id + 1bit: written before |

String binary

String binary encoding:

Algorithm Pattern Description
LOWER_SPECIAL a-z._$| every char is writen using 5 bits, a-z: 0b00000~0b11001, ._$|: 0b11010~0b11101
LOWER_UPPER_DIGIT_SPECIAL a-zA-Z0~9._$ every char is writen using 6 bits, a-z: 0b00000~0b11110, A-Z: 0b11010~0b110011, 0~9: 0b110100~0b111101, ._$: 0b111110~0b1000000
UTF-8 any chars UTF-8 encoding

Additional context

#1238

@chaokunyang chaokunyang added the enhancement New feature or request label Dec 22, 2023
@chaokunyang chaokunyang changed the title [Format] ascii optimized enumerated string encoding algorithm [Spec] ascii optimized enumerated string encoding algorithm Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant