Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to Arrow String type instead of LargeString #15047

Closed
kevinjqliu opened this issue Mar 14, 2024 · 4 comments
Closed

Default to Arrow String type instead of LargeString #15047

kevinjqliu opened this issue Mar 14, 2024 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@kevinjqliu
Copy link

Description

Context

I noticed that Polars converts Python string to Arrow LargeString type when the Dataframe's .to_arrow() function is called. This can be seen in this example.

Digging into the Rust code, string is converted to the Arrow LargeUtf8 data type. (1, 2)
It looks like in Rust, there's a pl_flavor boolean flag that can be set to use the regular Arrow string instead (1, 2) but this is not available in Python.

According to the Arrow docs, LargeString "may not be supported by all Arrow implementations. Unless you need to represent data larger than 2GB, you should prefer string()." (doc).

Because of the LargeString Arrow data type, I encountered issues integrating the Python Iceberg (pyiceberg) library with Polars. See #9795 (comment). Granted the issue can be fixed upstream in the pyiceberg library by integrating LargeString apache/iceberg-python#520.

Ask

Given the above, I think it's a good idea to default to the regular Arrow string when converting from Dataframe to Arrow.

@kevinjqliu kevinjqliu added the enhancement New feature or an improvement of an existing feature label Mar 14, 2024
@ritchie46
Copy link
Member

We will not do that. Arrow default string can only hold 2GB of data per column, leading to all kinds of slicing requirements. We deem the default string utterly unusable for our use cases. You can always cast from LargeString to String and implement your own slicing if required.

@kevinjqliu
Copy link
Author

kevinjqliu commented Mar 14, 2024

Thanks for the quick reply!

We deem the default string utterly unusable for our use cases

I'm still new to Polars. What are some use cases of LargeString?

You can always cast from LargeString to String and implement your own slicing if required.

We will probably do this for pyiceberg. apache/iceberg-python#520

It looks like in Rust, there's a pl_flavor boolean flag that can be set to use the regular Arrow string instead (1, 2) but this is not available in Python.

Is it feasible to expose this boolean flag in py-polars as well?

@deanm0000
Copy link
Collaborator

pl_flavor doesn't refer to the difference between a large_string and a string. It refers to the difference between a large_string and a utf8_view which doesn't seem to be implemented in pyarrow yet.

It seems @ritchie46 intended to close this as not planned so I'll do that now. Sorry if I'm mistaken on that point.

@deanm0000 deanm0000 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2024
@ritchie46
Copy link
Member

I'm still new to Polars. What are some use cases of LargeString?

Our in-memory engine favors large chunks (often single chunked dataframes). It is pretty easy to reach the 2GB string limit on user data that way.

Is it feasible to expose this boolean flag in py-polars as well?

This is to convert to string_view and is only temporary until arrow consumers implement binview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants