Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] support latin1/utf16 encoding for python string #1967

Open
chaokunyang opened this issue Dec 4, 2024 · 2 comments
Open

[Python] support latin1/utf16 encoding for python string #1967

chaokunyang opened this issue Dec 4, 2024 · 2 comments
Labels

Comments

@chaokunyang
Copy link
Collaborator

chaokunyang commented Dec 4, 2024

Feature Request

Fury java serialize string with three encodings:

  • latin1: used when all chars are latin1. This will be just a memory copy in jdk11+
  • utf16: used when 50%+ chars are not ascii chars
  • utf8: used when 50%+ chars are ascii chars

Fury java also use superword and bitmask for 8 bytes ascii check/writing at once, which will make encoding faster.

Here is the fury benchmark result with jdk/kryo/flink string serializer:
image

For pyfury, we should do similar things, and since pyfury can invoke c++ with low cost, we could implement string encodings using SIMD in c++ and let pyfury wrap that by cython.

Is your feature request related to a problem? Please describe

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

#1732
#1754
#1890
#1964

@chaokunyang
Copy link
Collaborator Author

Hi @penguin-wwy , are you interested in this issue?

@penguin-wwy
Copy link
Contributor

Hi @penguin-wwy , are you interested in this issue?

Okay, I will implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants