-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920
Comments
Antoine Pitrou / @pitrou: |
Karik Isichei / @isichei: Happy to give it a go myself, is there a prescribed process that I would go about tackling this? Would a good start for me be to look through a similar PR (like this one: https://github.com/apache/arrow/pull/4597/files) to get an idea of how exposing a new option for the parquet reader in pyarrow would be done? |
Antoine Pitrou / @pitrou:
|
Karik Isichei / @isichei: |
Micah Kornfield / @emkornfield: |
Karik Isichei / @isichei: Have had some time to look at this / get my head around the CPP codebase. What makes sense to me following @pitrou's advice (exposing the
Some Questions on the above:
|
Karik Isichei / @isichei:
Let me know if there are any problems or improvements. I was thinking of doing C++ and Python but thought better to solve the C++ functionality first and then do a secondary PR for exposing the functionality in Python. |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
When reading Parquet data with timestamps stored as INT96 pyarrow will assume that the timestamp type should be nanoseconds and when converted into an arrow table will cause overflow if the parquet col has stored values that are out of bounds for nanoseconds.
The above example is just trying to demonstrate this bug by getting pyarrow to write out the parquet format to a similar state of original file (where this bug was discovered). This bug was originally found when trying to read in Parquet outputs from Amazon Athena with pyarrow (where we can't control the output format of the parquet file format) Context.
I found some existing issues that might also be related:
Environment: macos mojave 10.14.6
Python 3.8.3
pyarrow 3.0.0
pandas 1.2.3
Reporter: Karik Isichei / @isichei
Assignee: Karik Isichei / @isichei
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-12096. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: