Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Converting data frame to Table with large nested column fails Invalid Struct child array has length smaller than expected #32439

Closed
asfimport opened this issue Jul 20, 2022 · 9 comments · Fixed by #37376

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jul 20, 2022

Hey, 

I have a data frame for which one column is a nested struct array. Converting it to a pyarrow.Table fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.

 

import pandas as pd
import pyarrow as pa
N_ROWS = 800_000
item_record = {
    "someImportantAssets": [
        {
            "square": "https://some.super.loooooooooong.link.com/withmany/lorem/upload/"
            "ipsum/stilllooooooooooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64"
            "/specificLink-i15348891"
        }
    ],
    "id": "i15348891",
    "title": "Some Long Item Title i15348891",
}
user_record = {
    "userId": "faa4648-4964drf-64648fafa648-4648falj",
    "data": [item_record for _ in range(24)],
}
df = pd.DataFrame([user_record for _ in range(N_ROWS)])
table = pa.Table.from_pandas(df)

 

Traceback (most recent call last):
  File "/.../scratch/experiment_pq.py", line 23, in <module>
    table = pa.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas
  File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (13338407 < 13338408) 

The length is always smaller than expected by 1.

Expected behavior:

Run without errors or fail with a better error message.

System Info and Versions:

Apple M1 Pro but also happened on amd64 Linux machine on AWS

 

arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
python                    3.9.7           h54d631c_3_cpython    conda-forge

I could also reproduce with


 pyarrow 8.0.0

Reporter: Simon Weiß

Related issues:

Note: This issue was originally created as ARROW-17137. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~SimonCW] thanks for the report! I can confirm the error on the latest master branch as well (on Linux).

@asfimport
Copy link
Collaborator Author

hadim:
I also confirm the bug for the same reasons with pyarrow 6, 7, 8 and 9.

 

Is there is a workaround waiting for a fix?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
A possible workaround is to do the conversion in chunks. Something like:

N = 100_000  # chunksize
pa.Table.from_batches([pa.record_batch(df.iloc[i:i+N]) for i in range(0, len(df), N)])

@CribberSix
Copy link

CribberSix commented Jun 5, 2023

The same error appears when converting a List of Dicts in Python (JSON data) to a pa.Table with pa.Table.from_pylist(mapping=messages, schema=schema).

I will test the workaround now, is there any possibility that this will be fixed anytime soon?

Workaround in my case is quite similar:

pa.Table.from_batches(
            batches=[
                pa.record_batch(data=messages[i : i + chunk_size], schema=schema)  # noqa E203
                for i in range(0, len(messages), chunk_size)
            ],
            Schema_schema=schema,
        )

@westonpace
Copy link
Member

The issues is going to happen anytime a single string column ends up with more than 2^31 characters. So, in OPs reproduction the column square has 161 characters per string and 800,000 * 24 strings which is 3,091,200,000 characters. 2^31 is 2,147,483,648. At this point we have to split the resulting array into chunks (or use the large_string data type but that has issues of its own).

This "breaking unexpectedly large columns into chunks" behavior is rather tricky and it appears we are doing something wrong when working with lists of struct arrays. Here's a compact reproducer (that only has 3 rows):

import pyarrow as pa
import pandas as pd

x = "0" * 1000000000
df = pd.DataFrame({"strings": [x, x, x]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

struct = {"struct_field": x}
df = pd.DataFrame({"structs": [struct, struct, struct]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

lists = [x]
df = pd.DataFrame({"lists": [lists, lists, lists]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

los = [struct]
df = pd.DataFrame({"los": [los, los, los]})
tab = pa.Table.from_pandas(df)
print(tab.column(0).num_chunks)

It seems the struct array has length 3. Meanwhile, it's child, the string array, has length 2 (because it had to be broken into 2 chunks. The first chunk has the first 2 values and the second chunk has the third).

So if someone wanted to investigate this I would recommend starting by looking at the conversion from pandas code and see how the struct array and list arrays are handling the case where their children is converted into multiple chunks.

@phyyou
Copy link

phyyou commented Jul 24, 2023

👀 Any updates for this issue?

@CribberSix
Copy link

The workaround works, but it definitely slows things down

@mikelui
Copy link
Contributor

mikelui commented Aug 22, 2023

I think the root cause is here:

In the blanket implementation of Array::Slice, only the offset and length of the underying ArrayData is modified.

For e.g. ListArrays, this means that when looking at its length and traversing its internal value offsets (its ArrayData) things will look as expected. But when we check the underlying child_data (the actual values ArrayData) we won't have the same sliced view. This is especially pertinent (I think) for having to do rewinding and hitting capacity issues. I surmise the final string child_data element for the struct (in the example) was never populated (length 2) although the StructArray had already been populated with 3 elements. Then, we slice the top-level ListArray, while holding onto the StructArray data that has 3 elements, with the last one being invalid.

The resulting ChunkedArray column from this conversion is passed back up to pyarrow and directly passed to Table::Make, with no other handling for chunked-array awareness.

I did a janky sanity test to make Array::Slice virtual and override in ListArray to slice the offset/length of the child_data, which seemed to work. But that's probably not what we want as an actual fix.

I'm guessing this will be the same for all nested types. Hence, we get the validation error which probably isn't expecting Slices that have extra data.

I'm not sure what the best path forward is here. It's not clear to me:

  • whether the Slice is actually what we want (do we want to keep the extra data around?)
  • how to support "deep slicing" for all the other less common types I don't have experience with (Unions, Dictionary, etc)

@westonpace can you advise if this sounds like the right place, and what next steps should look like?


EDIT: Actually I think I can make it work by overriding ToArray(length) in the PyConverters and doing some logic to check if shrinking is necessary. It still would be essentially a custom Slice implementation. But, it at least constrains the scope to just the python -> arrow conversions. I'll work on a PR soon.

@westonpace
Copy link
Member

@mikelui

In the blanket implementation of Array::Slice, only the offset and length of the underying ArrayData is modified.
whether the Slice is actually what we want (do we want to keep the extra data around?)

If memory serves this is correct (I am not quite 100% certain of this paragraph but "fairly confident" :) maybe someone else will confirm) and intended behavior. The length of a parent array may be shorted than the length of a child array. Offsets are cumulative so if a parent array has offset 10 and a child array has offset 5 then the memory location that is looked at for the 0th value will be the 15th slot.

EDIT: Actually I think I can make it work by overriding ToArray(length) in the PyConverters and doing some logic to check if shrinking is necessary. It still would be essentially a custom Slice implementation. But, it at least constrains the scope to just the python -> arrow conversions. I'll work on a PR soon.

I'm not very familiar with the python converters. However, I think we have a good amount of regression tests. If you can get them to pass while passing this case then you are probably on the right track.

pitrou added a commit that referenced this issue Oct 10, 2023
…37376)

### Rationale for this change

See: #32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: #32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 14.0.0 milestone Oct 10, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…cts (apache#37376)

### Rationale for this change

See: apache#32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: apache#32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…cts (apache#37376)

### Rationale for this change

See: apache#32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: apache#32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…cts (apache#37376)

### Rationale for this change

See: apache#32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: apache#32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants