[Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` #32439

asfimport · 2022-07-20T07:04:19Z

Hey,

I have a data frame for which one column is a nested struct array. Converting it to a pyarrow.Table fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.

import pandas as pd
import pyarrow as pa
N_ROWS = 800_000
item_record = {
    "someImportantAssets": [
        {
            "square": "https://some.super.loooooooooong.link.com/withmany/lorem/upload/"
            "ipsum/stilllooooooooooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64"
            "/specificLink-i15348891"
        }
    ],
    "id": "i15348891",
    "title": "Some Long Item Title i15348891",
}
user_record = {
    "userId": "faa4648-4964drf-64648fafa648-4648falj",
    "data": [item_record for _ in range(24)],
}
df = pd.DataFrame([user_record for _ in range(N_ROWS)])
table = pa.Table.from_pandas(df)

Traceback (most recent call last):
  File "/.../scratch/experiment_pq.py", line 23, in <module>
    table = pa.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas
  File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (13338407 < 13338408)

The length is always smaller than expected by 1.

Expected behavior:

Run without errors or fail with a better error message.

System Info and Versions:

Apple M1 Pro but also happened on amd64 Linux machine on AWS

arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
python                    3.9.7           h54d631c_3_cpython    conda-forge

I could also reproduce with


 pyarrow 8.0.0

Reporter: Simon Weiß

Related issues:

[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected (is cloned by)
[Python] List child array invalid (duplicates)
[Python] List child array invalid (is duplicated by)

_{Note: This issue was originally created as ARROW-17137. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-07-20T11:25:20Z

Joris Van den Bossche / @jorisvandenbossche:
[~SimonCW] thanks for the report! I can confirm the error on the latest master branch as well (on Linux).

asfimport · 2022-08-30T20:22:39Z

hadim:
I also confirm the bug for the same reasons with pyarrow 6, 7, 8 and 9.

Is there is a workaround waiting for a fix?

asfimport · 2022-09-08T07:36:06Z

Joris Van den Bossche / @jorisvandenbossche:
A possible workaround is to do the conversion in chunks. Something like:

N = 100_000  # chunksize
pa.Table.from_batches([pa.record_batch(df.iloc[i:i+N]) for i in range(0, len(df), N)])

CribberSix · 2023-06-05T10:53:27Z

The same error appears when converting a List of Dicts in Python (JSON data) to a pa.Table with pa.Table.from_pylist(mapping=messages, schema=schema).

I will test the workaround now, is there any possibility that this will be fixed anytime soon?

Workaround in my case is quite similar:

pa.Table.from_batches(
            batches=[
                pa.record_batch(data=messages[i : i + chunk_size], schema=schema)  # noqa E203
                for i in range(0, len(messages), chunk_size)
            ],
            Schema_schema=schema,
        )

westonpace · 2023-06-12T22:32:06Z

The issues is going to happen anytime a single string column ends up with more than 2^31 characters. So, in OPs reproduction the column square has 161 characters per string and 800,000 * 24 strings which is 3,091,200,000 characters. 2^31 is 2,147,483,648. At this point we have to split the resulting array into chunks (or use the large_string data type but that has issues of its own).

This "breaking unexpectedly large columns into chunks" behavior is rather tricky and it appears we are doing something wrong when working with lists of struct arrays. Here's a compact reproducer (that only has 3 rows):

import pyarrow as pa
import pandas as pd

x = "0" * 1000000000
df = pd.DataFrame({"strings": [x, x, x]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

struct = {"struct_field": x}
df = pd.DataFrame({"structs": [struct, struct, struct]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

lists = [x]
df = pd.DataFrame({"lists": [lists, lists, lists]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

los = [struct]
df = pd.DataFrame({"los": [los, los, los]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

It seems the struct array has length 3. Meanwhile, it's child, the string array, has length 2 (because it had to be broken into 2 chunks. The first chunk has the first 2 values and the second chunk has the third).

So if someone wanted to investigate this I would recommend starting by looking at the conversion from pandas code and see how the struct array and list arrays are handling the case where their children is converted into multiple chunks.

phyyou · 2023-07-24T12:45:08Z

👀 Any updates for this issue?

CribberSix · 2023-07-25T09:59:09Z

The workaround works, but it definitely slows things down

mikelui · 2023-08-22T16:11:30Z

I think the root cause is here:

In the blanket implementation of Array::Slice, only the offset and length of the underying ArrayData is modified.

For e.g. ListArrays, this means that when looking at its length and traversing its internal value offsets (its ArrayData) things will look as expected. But when we check the underlying child_data (the actual values ArrayData) we won't have the same sliced view. This is especially pertinent (I think) for having to do rewinding and hitting capacity issues. I surmise the final string child_data element for the struct (in the example) was never populated (length 2) although the StructArray had already been populated with 3 elements. Then, we slice the top-level ListArray, while holding onto the StructArray data that has 3 elements, with the last one being invalid.

The resulting ChunkedArray column from this conversion is passed back up to pyarrow and directly passed to Table::Make, with no other handling for chunked-array awareness.

I did a janky sanity test to make Array::Slice virtual and override in ListArray to slice the offset/length of the child_data, which seemed to work. But that's probably not what we want as an actual fix.

I'm guessing this will be the same for all nested types. Hence, we get the validation error which probably isn't expecting Slices that have extra data.

I'm not sure what the best path forward is here. It's not clear to me:

whether the Slice is actually what we want (do we want to keep the extra data around?)
how to support "deep slicing" for all the other less common types I don't have experience with (Unions, Dictionary, etc)

@westonpace can you advise if this sounds like the right place, and what next steps should look like?

EDIT: Actually I think I can make it work by overriding ToArray(length) in the PyConverters and doing some logic to check if shrinking is necessary. It still would be essentially a custom Slice implementation. But, it at least constrains the scope to just the python -> arrow conversions. I'll work on a PR soon.

westonpace · 2023-08-23T15:00:24Z

@mikelui

In the blanket implementation of Array::Slice, only the offset and length of the underying ArrayData is modified.
whether the Slice is actually what we want (do we want to keep the extra data around?)

If memory serves this is correct (I am not quite 100% certain of this paragraph but "fairly confident" :) maybe someone else will confirm) and intended behavior. The length of a parent array may be shorted than the length of a child array. Offsets are cumulative so if a parent array has offset 10 and a child array has offset 5 then the memory location that is looked at for the 0th value will be the 15th slot.

EDIT: Actually I think I can make it work by overriding ToArray(length) in the PyConverters and doing some logic to check if shrinking is necessary. It still would be essentially a custom Slice implementation. But, it at least constrains the scope to just the python -> arrow conversions. I'll work on a PR soon.

I'm not very familiar with the python converters. However, I think we have a good amount of regression tests. If you can get them to pass while passing this case then you are probably on the right track.

…37376) ### Rationale for this change See: #32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: #32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…cts (apache#37376) ### Rationale for this change See: apache#32439 ### What changes are included in this PR? During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children. This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other. This is per: https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508 ### Are these changes tested? A unit test is added that would previously have an invalid data error. ``` > tab = pa.Table.from_pandas(df) pyarrow/tests/test_pandas.py:4970: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas return cls.from_arrays(arrays, schema=schema) pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays result.validate() pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate check_status(self.table.Validate()) # ... FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3) ``` NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it. ### Are there any user-facing changes? No * Closes: apache#32439 Lead-authored-by: Mike Lui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

asfimport added this to the 11.0.0 milestone Jan 11, 2023

This was referenced Jan 11, 2023

[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected #32440

Closed

[Python] List child array invalid #33082

Closed

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

mikelui mentioned this issue Aug 25, 2023

GH-32439: [Python] Fix off by one bug when chunking nested structs #37376

Merged

github-actions bot assigned mikelui Aug 25, 2023

pitrou closed this as completed in #37376 Oct 10, 2023

pitrou added this to the 14.0.0 milestone Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` #32439

[Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` #32439

asfimport commented Jul 20, 2022 •

edited

Loading

asfimport commented Jul 20, 2022

asfimport commented Aug 30, 2022

asfimport commented Sep 8, 2022

CribberSix commented Jun 5, 2023 •

edited

Loading

westonpace commented Jun 12, 2023

phyyou commented Jul 24, 2023

CribberSix commented Jul 25, 2023

mikelui commented Aug 22, 2023 •

edited

Loading

westonpace commented Aug 23, 2023

[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected #32439

[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected #32439

Comments

asfimport commented Jul 20, 2022 • edited Loading

Expected behavior:

System Info and Versions:

Related issues:

asfimport commented Jul 20, 2022

asfimport commented Aug 30, 2022

asfimport commented Sep 8, 2022

CribberSix commented Jun 5, 2023 • edited Loading

westonpace commented Jun 12, 2023

phyyou commented Jul 24, 2023

CribberSix commented Jul 25, 2023

mikelui commented Aug 22, 2023 • edited Loading

westonpace commented Aug 23, 2023

[Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` #32439

[Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected` #32439

asfimport commented Jul 20, 2022 •

edited

Loading

CribberSix commented Jun 5, 2023 •

edited

Loading

mikelui commented Aug 22, 2023 •

edited

Loading