Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_parquet(..., pyarrow_options={'partition_cols': [..., ]}) munges partition column #17619

Open
2 tasks done
wasade opened this issue Jul 13, 2024 · 5 comments
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@wasade
Copy link

wasade commented Jul 13, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

In [93]: pl.__version__
Out[93]: '1.1.0'

In [94]: df = pl.DataFrame([[0, 1, 2, 'a'], [3,4,5, 'b'], [6,7,8, 'c']], orient='row')

In [95]: df
Out[95]: 
shape: (3, 4)
┌──────────┬──────────┬──────────┬──────────┐
│ column_0column_1column_2column_3 │
│ ------------      │
│ i64i64i64str      │
╞══════════╪══════════╪══════════╪══════════╡
│ 012a        │
│ 345b        │
│ 678c        │
└──────────┴──────────┴──────────┴──────────┘

In [96]: df.write_parquet('../ex', use_pyarrow=True, pyarrow_options={'partition_cols': ['column_3', ]})

In [97]: ls ../ex
total 192K
drwxr-xr-x 2 dtmcdonald 4.0K Jul 13 15:08 column_3=[?  "a"?]
drwxr-xr-x 2 dtmcdonald 4.0K Jul 13 15:08 column_3=[?  "b"?]
drwxr-xr-x 2 dtmcdonald 4.0K Jul 13 15:08 column_3=[?  "c"?]

In [98]: pl.scan_parquet('../ex').collect()
Out[98]: 
shape: (3, 4)
┌──────────┬──────────┬──────────┬──────────┐
│ column_0column_1column_2column_3 │
│ ------------      │
│ i64i64i64str      │
╞══════════╪══════════╪══════════╪══════════╡
│ 012        ┆ [        │
│          ┆          ┆          ┆   "a"    │
│          ┆          ┆          ┆ ]        │
│ 345        ┆ [        │
│          ┆          ┆          ┆   "b"    │
│          ┆          ┆          ┆ ]        │
│ 678        ┆ [        │
│          ┆          ┆          ┆   "c"    │
│          ┆          ┆          ┆ ]        │
└──────────┴──────────┴──────────┴──────────┘

Log output

No output generated

Issue description

Use of pyarrow_options={'partition_cols': [..., ]} appears to mangle the column being partitioned. As far as I can tell, the example is consistent with the documentation.

Not sure if relevant, but it appears that the partitioned column here is regarded as a large_string even though each value is a single character.

In [9]: pa = df.to_arrow()

In [10]: pa
Out[10]: 
pyarrow.Table
column_0: int64
column_1: int64
column_2: int64
column_3: large_string
----
column_0: [[0,3,6]]
column_1: [[1,4,7]]
column_2: [[2,5,8]]
column_3: [["a","b","c"]]

Thank you for your time and effort. Please let me know if additional information is helpful, or if there is a suggested direction which I could follow to issue a PR (if one is warranted).

Expected behavior

I would expect the output directory names to correspond to the values on write, and I would expect the column values to be unchanged on read.

Installed versions

In [8]: pl.show_versions()
--------Version info---------
Polars:               1.1.0
Index type:           UInt32
Platform:             Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.17
Python:               3.11.3 (main, Apr 19 2023, 23:54:32) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.18.0
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              10.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@wasade wasade added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 13, 2024
@cmdlineluser
Copy link
Contributor

Just a mention that .write_parquet_partitioned() was added in 1.1.0 which may be a possible workaround.

@wasade
Copy link
Author

wasade commented Jul 13, 2024

Thank you @cmdlineluse! I'll take it for a spin at the next chance

@Julian-J-S
Copy link
Contributor

Just a mention that .write_parquet_partitioned() was added in 1.1.0 which may be a possible workaround.

Not sure if this gets removed again but partitioning was added to the original functions in #17586

@wasade
Copy link
Author

wasade commented Jul 15, 2024

On the surface, this work around seems to work although I don't believe it supports append right now which would be a nice to have.

@cmdlineluser
Copy link
Contributor

The problem does seem to be large_string as you mentioned.

If we use pyarrow directly:

import pyarrow as pa
import pyarrow.parquet as pq

tbl = pa.table(
    {
        "year": [2020, 2022, 2021],                
        "animal": ["a", "b", "c"]
    }, 
    schema=pa.schema([("year", pa.int16()), ("animal", pa.large_string())])
)

pq.write_to_dataset(tbl, root_path="17619", partition_cols=["animal"])
>>> !ls 17619
animal=[?  "a"?] animal=[?  "b"?] animal=[?  "c"?]

Someone else seems to have had the same problem in #15181 (although they say they get ... instead which I cannot replicate.)

It seems some others have asked why large_string is used.

I can't seem to find any pyarrow docs/issues but it seems partition_cols must not be of type large_string?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants