You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the usage question you have. Please include as many useful details as possible.
Asking this as a usage question instead of a bug report because it is more likely this is a usage issue than a library problem, but I don't know for sure.
I'm indirectly using arrow as a dependency of pyarrow which is being used by the AWS Wrangler SDK (which is what my application directly interfaces with).
I have run into an error condition when attempting to convert a Pandas data frame to a arrow table to be written as Parquet to S3. The specific error comes from builder_nested.cc
When the error occurs, I can see the following log printed, immediately followed by a SIGABRT and process crash:
/arrow/cpp/src/arrow/array/builder_nested.cc:103: Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
Aborted (core dumped)
Given this code is many layers of abstraction away from my application code, I am having an very hard time tracing down the source of the issue.
What I know / have been able to track down so far:
The issue has something to do with a specific string<>string Map column. The issue is likely due to the values included in that column, since the error doesn't happen when the column is excluded, or has it's contents set to dummy values
The issue requires a certain number of rows in the dataset. When I manually partition my input data into small chunks and serialize each partition individually and separately, the error does not occur and each individual partition can be serialized successfully. The error only shows up when serializing a significantly large enough number of records (in my case, ~8M rows).
I have been working to try to narrow down a minimal reproduction, but it has been slow going. In the meantime, I would like ask for help seeing if there are any potential steps I could take to narrow down the potential causes of the issue. As a first step, it would be helpful to understand what scenario this protection check is intended to guard against, so I can see whether anything that I am doing with the data I am trying to serialize is likely to be tripping it.
Unfortunately, it seems like hitting this check condition is a very uncommon occurrence so there are next to no existing reports or discussions online about the conditions under which the arrow check fails.
Any help is much appreciated.
Component(s)
C++, Parquet, Python
The text was updated successfully, but these errors were encountered:
Describe the usage question you have. Please include as many useful details as possible.
Asking this as a usage question instead of a bug report because it is more likely this is a usage issue than a library problem, but I don't know for sure.
I'm indirectly using arrow as a dependency of pyarrow which is being used by the AWS Wrangler SDK (which is what my application directly interfaces with).
I have run into an error condition when attempting to convert a Pandas data frame to a arrow table to be written as Parquet to S3. The specific error comes from builder_nested.cc
When the error occurs, I can see the following log printed, immediately followed by a SIGABRT and process crash:
Given this code is many layers of abstraction away from my application code, I am having an very hard time tracing down the source of the issue.
What I know / have been able to track down so far:
The issue has something to do with a specific string<>string Map column. The issue is likely due to the values included in that column, since the error doesn't happen when the column is excluded, or has it's contents set to dummy values
The issue requires a certain number of rows in the dataset. When I manually partition my input data into small chunks and serialize each partition individually and separately, the error does not occur and each individual partition can be serialized successfully. The error only shows up when serializing a significantly large enough number of records (in my case, ~8M rows).
I have been working to try to narrow down a minimal reproduction, but it has been slow going. In the meantime, I would like ask for help seeing if there are any potential steps I could take to narrow down the potential causes of the issue. As a first step, it would be helpful to understand what scenario this protection check is intended to guard against, so I can see whether anything that I am doing with the data I am trying to serialize is likely to be tripping it.
Unfortunately, it seems like hitting this check condition is a very uncommon occurrence so there are next to no existing reports or discussions online about the conditions under which the arrow check fails.
Any help is much appreciated.
Component(s)
C++, Parquet, Python
The text was updated successfully, but these errors were encountered: