Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLUTION IN COMMENTS: write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is NOT an issue in pyarrow.parquet.write_to_dataset #15181

Open
2 tasks done
fpbeekhof1977 opened this issue Mar 20, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@fpbeekhof1977
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
data = {'k':[10, 20], 'col1': [1,2], 'col2':[3,4], 'p':['a', 'b'] }
d = pl.DataFrame(data)
d.write_parquet('/tmp/d-polars', use_pyarrow=True, pyarrow_options={'partition_cols': ['p']})
import pyarrow as pa
pa.parquet.write_to_dataset(pa.Table.from_pydict(data), '/tmp/d-pyarrow', partition_cols=['p'])

Now try:

$ ls -lR /tmp/d-polars
'p=...'
$ ls -lR /tmp/d-pyarrow
'p=a'  'p=b'

Log output

No response

Issue description

All values of the partition column are mapped onto the string "..." rather than their value.

Expected behavior

Partition columns of string type are correctly written with their value intact.

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.0-1055-aws-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.7.0
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              8.0.0
pydantic:             1.10.6
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@fpbeekhof1977 fpbeekhof1977 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 20, 2024
@fpbeekhof1977 fpbeekhof1977 changed the title write_parquet() using pyarrow a partition_col of type "str" maps all partition values to "...". This is not a pyarrow issue. write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is not a pyarrow issue. Mar 20, 2024
@fpbeekhof1977 fpbeekhof1977 changed the title write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is not a pyarrow issue. write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is an issue in pyarrow.parquet.write_to_dataset Mar 26, 2024
@fpbeekhof1977 fpbeekhof1977 changed the title write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is an issue in pyarrow.parquet.write_to_dataset write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is NOT an issue in pyarrow.parquet.write_to_dataset Mar 26, 2024
@fpbeekhof1977
Copy link
Author

The issue does not appear when creating the dataset directly in py-arrow.
However, it does happen when converting a dataframe to pyarrow and then using pyarrow's parquet.write_to_dataset method.

@fpbeekhof1977
Copy link
Author

fpbeekhof1977 commented Mar 26, 2024

The reason is that polars' ".to_arrow()" converts "str" columns to pyarrow's "large_string" rather than "string".
The partitions are correctly written if the pyarrow type is "string" rather than "large_string".

SOLUTION: "to_arrow()" should map columns of type "str" onto pyarrow's "string", not "large_string".

image

@fpbeekhof1977 fpbeekhof1977 changed the title write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is NOT an issue in pyarrow.parquet.write_to_dataset SOLUTION IN COMMENTS: write_parquet() using pyarrow with a "partition_cols" of type "str" maps all partition values to "...". This is NOT an issue in pyarrow.parquet.write_to_dataset Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant