Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32439: [Python] Fix off by one bug when chunking nested structs #37376

Merged
merged 2 commits into from
Oct 10, 2023

Conversation

mikelui
Copy link
Contributor

@mikelui mikelui commented Aug 25, 2023

Rationale for this change

See: #32439

What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per:

/// Append an element to the Struct. All child-builders' Append method must
/// be called independently to maintain data-structure consistency.

Are these changes tested?

A unit test is added that would previously have an invalid data error.

>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

Are there any user-facing changes?

No

@github-actions
Copy link

⚠️ GitHub issue #32439 has been automatically assigned in GitHub to PR creator.

@mikelui
Copy link
Contributor Author

mikelui commented Aug 25, 2023

Failures are due to high memory from the unit test 😮‍💨

@mikelui
Copy link
Contributor Author

mikelui commented Aug 25, 2023

Lowered the size of the binary array to be within mem limits (barely scraping by 🥹)

720,000,000 * 3 = 2,160,000,000, which triggers chunking over 2,147,483,647


edit: aaarg Python 3.8 w/ Pandas 1.0 works, while Python 3.10 with Pandas latest hits mem limits.

This test is specifically for cases with high memory and chunking requirements. I'm open to other ideas for testing

@mikelui
Copy link
Contributor Author

mikelui commented Aug 27, 2023

I removed the additional test that was blocking tests, due to intensive memory requirements.

For posterity, I left the test in as the initial commit (with the removal being a successive commit). Folks can confirm the behavior fix independently.

I don't think we should block fixing a critical bug on the memory intensive test.


EDIT: Realized I can mark tests as large memory.
Ran all the large memory tests locally and found that I needed to set PyStructConverter to not rewind on a capacity error, since now we wait for confirmation that child builders succeeded before appending there.

The test that exposed this is:

@pytest.mark.large_memory
@pytest.mark.parametrize(('ty', 'char'), [
(pa.string(), 'x'),
(pa.binary(), b'x'),
])
def test_nested_auto_chunking(ty, char):
v1 = char * 100000000
v2 = char * 147483646
struct_type = pa.struct([
pa.field('bool', pa.bool_()),
pa.field('integer', pa.int64()),
pa.field('string-like', ty),
])
data = [{'bool': True, 'integer': 1, 'string-like': v1}] * 20
data.append({'bool': True, 'integer': 1, 'string-like': v2})
arr = pa.array(data, type=struct_type)
assert isinstance(arr, pa.Array)
data.append({'bool': True, 'integer': 1, 'string-like': char})
arr = pa.array(data, type=struct_type)
assert isinstance(arr, pa.ChunkedArray)
assert arr.num_chunks == 2
assert len(arr.chunk(0)) == 21
assert len(arr.chunk(1)) == 1
assert arr.chunk(1)[0].as_py() == {

@mikelui mikelui changed the title GH-32439: [Python] Change order of Append during PyStructConverter GH-32439: [Python] Fix off by one when chunking nested structs Aug 29, 2023
@mikelui mikelui changed the title GH-32439: [Python] Fix off by one when chunking nested structs GH-32439: [Python] Fix off by one bug when chunking nested structs Aug 29, 2023
@c0g
Copy link

c0g commented Sep 10, 2023

I’m hitting this error too, any chance this PR could be merged?

@mikelui
Copy link
Contributor Author

mikelui commented Sep 11, 2023

@westonpace since you seemed to be PoC on the GH issue, can you take a look or direct us to the right person to review this? It seems there's a bit of a backlog in reviewing [Python] PR's, anyway.

@mikelui
Copy link
Contributor Author

mikelui commented Oct 3, 2023

@AlenkaF @wjones127 There are some backlog PRs is there any chance of getting this reviewed?

@AlenkaF
Copy link
Member

AlenkaF commented Oct 4, 2023

Sorry for slow response from us @mikelui , thank you for keeping it going!
I plan to look at the PR today.

PS: The mark for the tests with large memory is the correct way to go 👍

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

The solution is elegant IMO. The tests added are also giving good coverage. I would only have another sanity check of the C++ change from somebody else and then I am happy to merge!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 4, 2023
@mikelui
Copy link
Contributor Author

mikelui commented Oct 10, 2023

cc @westonpace @wjones127 @mapleFU @pitrou can someone take a look here? 🥹

Mike Lui and others added 2 commits October 10, 2023 18:23
During conversion from Python to Arrow, when a struct's child
hits a capacity error and chunking is triggered, this can leave
the Finish'd chunk in an invalid state since the struct's length
does not match the length of its children.

This change simply tries to Append the children first, and only
if successful will Append the struct. This is safe because the
order of Append'ing between the struct and its child is not
specified. It is only specified that they must be consistent
with each other.
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed small test improvements and rebased from latest git main. Thanks for this @mikelui !

@pitrou pitrou merged commit 8cdce28 into apache:main Oct 10, 2023
12 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Oct 10, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8cdce28.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…cts (apache#37376)

### Rationale for this change

See: apache#32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: apache#32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@mikelui mikelui deleted the fix-GH-32439 branch October 23, 2023 18:31
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…cts (apache#37376)

### Rationale for this change

See: apache#32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: apache#32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@anjakefala anjakefala added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Nov 14, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…cts (apache#37376)

### Rationale for this change

See: apache#32439 

### What changes are included in this PR?

During conversion from Python to Arrow, when a struct's child hits a capacity error and chunking is triggered, this can leave the Finish'd chunk in an invalid state since the struct's length does not match the length of its children.

This change simply tries to Append the children first, and only if successful will Append the struct. This is safe because the order of Append'ing between the struct and its child is not specified. It is only specified that they must be consistent with each other.

This is per: 

https://github.com/apache/arrow/blob/86b7a84c9317fa08222eb63f6930bbb54c2e6d0b/cpp/src/arrow/array/builder_nested.h#L507-L508

### Are these changes tested?

A unit test is added that would previously have an invalid data error.

```
>       tab = pa.Table.from_pandas(df)

pyarrow/tests/test_pandas.py:4970: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/table.pxi:3788: in pyarrow.lib.Table.from_pandas
    return cls.from_arrays(arrays, schema=schema)
pyarrow/table.pxi:3890: in pyarrow.lib.Table.from_arrays
    result.validate()
pyarrow/table.pxi:3170: in pyarrow.lib.Table.validate
    check_status(self.table.Validate())

# ...

FAILED pyarrow/tests/test_pandas.py::test_nested_chunking_valid - pyarrow.lib.ArrowInvalid: Column 0: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (2 < 3)
```

NOTE: This unit test uses about 7GB of memory (max RSS) on my macbook pro. This might make CI challenging; I'm open to suggestions to limit it.

### Are there any user-facing changes?

No
* Closes: apache#32439

Lead-authored-by: Mike Lui <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.
Projects
None yet
5 participants