-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to Arrow String type instead of LargeString #15047
Comments
We will not do that. Arrow default string can only hold 2GB of data per column, leading to all kinds of slicing requirements. We deem the default string utterly unusable for our use cases. You can always cast from |
Thanks for the quick reply!
I'm still new to Polars. What are some use cases of
We will probably do this for pyiceberg. apache/iceberg-python#520
Is it feasible to expose this boolean flag in py-polars as well? |
It seems @ritchie46 intended to close this as not planned so I'll do that now. Sorry if I'm mistaken on that point. |
Our in-memory engine favors large chunks (often single chunked dataframes). It is pretty easy to reach the 2GB string limit on user data that way.
This is to convert to string_view and is only temporary until arrow consumers implement |
Description
Context
I noticed that Polars converts Python string to Arrow
LargeString
type when the Dataframe's.to_arrow()
function is called. This can be seen in this example.Digging into the Rust code, string is converted to the Arrow
LargeUtf8
data type. (1, 2)It looks like in Rust, there's a
pl_flavor
boolean flag that can be set to use the regular Arrow string instead (1, 2) but this is not available in Python.According to the Arrow docs, LargeString "may not be supported by all Arrow implementations. Unless you need to represent data larger than 2GB, you should prefer string()." (doc).
Because of the
LargeString
Arrow data type, I encountered issues integrating the Python Iceberg (pyiceberg) library with Polars. See #9795 (comment). Granted the issue can be fixed upstream in the pyiceberg library by integrating LargeString apache/iceberg-python#520.Ask
Given the above, I think it's a good idea to default to the regular Arrow string when converting from Dataframe to Arrow.
The text was updated successfully, but these errors were encountered: