Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected #32440

Closed
asfimport opened this issue Jul 20, 2022 · 1 comment

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jul 20, 2022

Hey, 

I have a data frame for which one column is a nested struct array. Converting it to a pyarrow.Table fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.

import pandas as pd
import pyarrow as pa

N_ROWS = 800_000
item_record = {
    "someImportantAssets": [
        {
            "square": "https://some.super.loooooooooong.link.com/withmany/lorem/upload/ipsum/stilllooooooooooonger/lorem/\{someparameter}/156fdjjf644984dfdfaera648/specificLink-i15348891"
        }
    ],
    "id": "i15348891",
    "title": "Some Long Item Title i15348891",
}

user_record = {
    "userId": "faa4648-4964drf-64648fafa648-4648falj",
    "data": [item_record for _ in range(24)],
}

df = pd.DataFrame([user_record for _ in range(N_ROWS)])
table = pa.Table.from_pandas(df)
Traceback (most recent call last):
    table = pa.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 1658, in pyarrow.lib.Table.from_pandas
  File "pyarrow/table.pxi", line 1702, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1314, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (13256071 < 13256072)

The length is always smaller than expected by 1.

Expected behavior:

Run without errors or fail with a better error message.

System Info and Versions:

Apple M1 Pro but also happened on amd64 Linux machine on AWS

arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge

python                    3.9.7           h54d631c_3_cpython    conda-forge

I could also reproduce with pyarrow 8.0.0

Reporter: Simon Weiß

Related issues:

Note: This issue was originally created as ARROW-17138. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

hadim:
I also confirm the bug for the same reasons with pyarrow 6, 7, 8 and 9.

 

Is there is a workaround waiting for a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant