Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging obscure MapBuilder error in builder_nested.cc #44640

Open
snakingfire opened this issue Nov 4, 2024 · 0 comments
Open

Debugging obscure MapBuilder error in builder_nested.cc #44640

snakingfire opened this issue Nov 4, 2024 · 0 comments

Comments

@snakingfire
Copy link

Describe the usage question you have. Please include as many useful details as possible.

Asking this as a usage question instead of a bug report because it is more likely this is a usage issue than a library problem, but I don't know for sure.

I'm indirectly using arrow as a dependency of pyarrow which is being used by the AWS Wrangler SDK (which is what my application directly interfaces with).

I have run into an error condition when attempting to convert a Pandas data frame to a arrow table to be written as Parquet to S3. The specific error comes from builder_nested.cc

When the error occurs, I can see the following log printed, immediately followed by a SIGABRT and process crash:

/arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
Aborted (core dumped)

Given this code is many layers of abstraction away from my application code, I am having an very hard time tracing down the source of the issue.

What I know / have been able to track down so far:

  1. The issue has something to do with a specific string<>string Map column. The issue is likely due to the values included in that column, since the error doesn't happen when the column is excluded, or has it's contents set to dummy values

  2. The issue requires a certain number of rows in the dataset. When I manually partition my input data into small chunks and serialize each partition individually and separately, the error does not occur and each individual partition can be serialized successfully. The error only shows up when serializing a significantly large enough number of records (in my case, ~8M rows).

I have been working to try to narrow down a minimal reproduction, but it has been slow going. In the meantime, I would like ask for help seeing if there are any potential steps I could take to narrow down the potential causes of the issue. As a first step, it would be helpful to understand what scenario this protection check is intended to guard against, so I can see whether anything that I am doing with the data I am trying to serialize is likely to be tripping it.

Unfortunately, it seems like hitting this check condition is a very uncommon occurrence so there are next to no existing reports or discussions online about the conditions under which the arrow check fails.

Any help is much appreciated.

Component(s)

C++, Parquet, Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant