Debugging obscure MapBuilder error in builder_nested.cc #44640

snakingfire · 2024-11-04T23:26:08Z

Describe the usage question you have. Please include as many useful details as possible.

Asking this as a usage question instead of a bug report because it is more likely this is a usage issue than a library problem, but I don't know for sure.

I'm indirectly using arrow as a dependency of pyarrow which is being used by the AWS Wrangler SDK (which is what my application directly interfaces with).

I have run into an error condition when attempting to convert a Pandas data frame to a arrow table to be written as Parquet to S3. The specific error comes from builder_nested.cc

When the error occurs, I can see the following log printed, immediately followed by a SIGABRT and process crash:

/arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
Aborted (core dumped)

Given this code is many layers of abstraction away from my application code, I am having an very hard time tracing down the source of the issue.

What I know / have been able to track down so far:

The issue has something to do with a specific string<>string Map column. The issue is likely due to the values included in that column, since the error doesn't happen when the column is excluded, or has it's contents set to dummy values
The issue requires a certain number of rows in the dataset. When I manually partition my input data into small chunks and serialize each partition individually and separately, the error does not occur and each individual partition can be serialized successfully. The error only shows up when serializing a significantly large enough number of records (in my case, ~8M rows).

I have been working to try to narrow down a minimal reproduction, but it has been slow going. In the meantime, I would like ask for help seeing if there are any potential steps I could take to narrow down the potential causes of the issue. As a first step, it would be helpful to understand what scenario this protection check is intended to guard against, so I can see whether anything that I am doing with the data I am trying to serialize is likely to be tripping it.

Unfortunately, it seems like hitting this check condition is a very uncommon occurrence so there are next to no existing reports or discussions online about the conditions under which the arrow check fails.

Any help is much appreciated.

Component(s)

C++, Parquet, Python

The text was updated successfully, but these errors were encountered:

snakingfire added the Type: usage Issue is a user question label Nov 4, 2024

github-actions bot added Component: Parquet Component: C++ Component: Python labels Nov 4, 2024

snakingfire mentioned this issue Nov 5, 2024

[Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging obscure MapBuilder error in builder_nested.cc #44640

Debugging obscure MapBuilder error in builder_nested.cc #44640

snakingfire commented Nov 4, 2024

Debugging obscure MapBuilder error in builder_nested.cc #44640

Debugging obscure MapBuilder error in builder_nested.cc #44640

Comments

snakingfire commented Nov 4, 2024

Describe the usage question you have. Please include as many useful details as possible.

Component(s)