Default to Arrow String type instead of LargeString #15047

kevinjqliu · 2024-03-14T03:50:45Z

Description

Context

I noticed that Polars converts Python string to Arrow LargeString type when the Dataframe's .to_arrow() function is called. This can be seen in this example.

Digging into the Rust code, string is converted to the Arrow LargeUtf8 data type. (1, 2)
It looks like in Rust, there's a pl_flavor boolean flag that can be set to use the regular Arrow string instead (1, 2) but this is not available in Python.

According to the Arrow docs, LargeString "may not be supported by all Arrow implementations. Unless you need to represent data larger than 2GB, you should prefer string()." (doc).

Because of the LargeString Arrow data type, I encountered issues integrating the Python Iceberg (pyiceberg) library with Polars. See #9795 (comment). Granted the issue can be fixed upstream in the pyiceberg library by integrating LargeString apache/iceberg-python#520.

Ask

Given the above, I think it's a good idea to default to the regular Arrow string when converting from Dataframe to Arrow.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-03-14T08:34:32Z

We will not do that. Arrow default string can only hold 2GB of data per column, leading to all kinds of slicing requirements. We deem the default string utterly unusable for our use cases. You can always cast from LargeString to String and implement your own slicing if required.

kevinjqliu · 2024-03-14T15:05:24Z

Thanks for the quick reply!

We deem the default string utterly unusable for our use cases

I'm still new to Polars. What are some use cases of LargeString?

You can always cast from LargeString to String and implement your own slicing if required.

We will probably do this for pyiceberg. apache/iceberg-python#520

It looks like in Rust, there's a pl_flavor boolean flag that can be set to use the regular Arrow string instead (1, 2) but this is not available in Python.

Is it feasible to expose this boolean flag in py-polars as well?

deanm0000 · 2024-03-18T17:27:15Z

pl_flavor doesn't refer to the difference between a large_string and a string. It refers to the difference between a large_string and a utf8_view which doesn't seem to be implemented in pyarrow yet.

It seems @ritchie46 intended to close this as not planned so I'll do that now. Sorry if I'm mistaken on that point.

ritchie46 · 2024-03-18T17:36:17Z

I'm still new to Polars. What are some use cases of LargeString?

Our in-memory engine favors large chunks (often single chunked dataframes). It is pretty easy to reach the 2GB string limit on user data that way.

Is it feasible to expose this boolean flag in py-polars as well?

This is to convert to string_view and is only temporary until arrow consumers implement binview.

kevinjqliu added the enhancement New feature or an improvement of an existing feature label Mar 14, 2024

deanm0000 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2024

Sara-ShiHo mentioned this issue Apr 10, 2024

polars.DataFrame.to_arrow() unnecessarily sets string columns as large_string type #15589

Closed

2 tasks

adamreeve mentioned this issue Jul 14, 2024

[C#] Support new data types apache/arrow#34736

Open

9 tasks

cmdlineluser mentioned this issue Jul 15, 2024

write_parquet(..., pyarrow_options={'partition_cols': [..., ]}) munges partition column #17619

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default to Arrow String type instead of LargeString #15047

Default to Arrow String type instead of LargeString #15047

kevinjqliu commented Mar 14, 2024

ritchie46 commented Mar 14, 2024

kevinjqliu commented Mar 14, 2024 •

edited

Loading

deanm0000 commented Mar 18, 2024

ritchie46 commented Mar 18, 2024

Default to Arrow String type instead of LargeString #15047

Default to Arrow String type instead of LargeString #15047

Comments

kevinjqliu commented Mar 14, 2024

Description

Context

Ask

ritchie46 commented Mar 14, 2024

kevinjqliu commented Mar 14, 2024 • edited Loading

deanm0000 commented Mar 18, 2024

ritchie46 commented Mar 18, 2024

kevinjqliu commented Mar 14, 2024 •

edited

Loading