-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455
Comments
I'd have expected the ns timestamp to be written to @emkornfield do you know what we should do on the Rust side to roundtrip the file correctly? |
pyarrow parquet writer has an option that one can set to write ns timestamp as |
Sorry, I would have to dig in the code to have a better understanding. In general we try to avoid int96 because it is a deprecated type. If this is an issue with pyarrow can we open a separate bug to track that? |
This may be related to which priority we give when converting, the parquet schema or the arrow schema. I would expect pyarrow to write them in a consistent manner though, so, as @nevi-me mentioned, an arrow schema in ns with a parquet schema in ms does seem odd. |
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.cc#L215 is probably the relevant writing code for 1.0 so that might explain the discrepancy, let me see If I can dig up the reader code. |
Here is C++ attempts to reuse original metadata. It doesn't look like we would use the arrow schema in this case. I'm guessing conversion back the Pandas type happens correctly because microseconds would be cast back at the arrow to pandas layer. |
So I think there is probably a bug in pyarrow here since we wouldn't round trip a pure timestamp[ns] type, but we don't run into it in from pandas because the casting just works. |
Describe the bug
Here is a pandas dataframe with nanosecond timestamp Data index:
When storing this dataframe into parquet 1.0 format, pyarrow stores the Date column in microsecond unit. pyarrow is able to load the Date column with microsecond precision as well:
But when loaded using arrow parquet crate, it is incorrectly loaded as nanosecond timestamp type.
To Reproduce
Here is a sample file to reproduce the issue: https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.
The file can be reproduced with the following python code:
Expected behavior
Data
column should be loaded as micro second precision.Additional context
Arrow parquet crate handles parquet 2.0 files without any issue.
Initially reported in roapi/roapi#42.
Here is the decoded ipc field from the
'ARROW:schema'
metadata for the Date column in arrow crate:The text was updated successfully, but these errors were encountered: