Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

Closed
asfimport opened this issue Mar 26, 2021 · 9 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 26, 2021

When reading Parquet data with timestamps stored as INT96 pyarrow will assume that the timestamp type should be nanoseconds and when converted into an arrow table will cause overflow if the parquet col has stored values that are out of bounds for nanoseconds. 

# Round Trip Example
import datetime
import pandas as pd
import pyarrow as pa
from pyarrow import parquet as pq

df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
a_df = pa.Table.from_pandas(df)
a_df.schema # a: timestamp[us] 

pq.write_table(a_df, "test_round_trip.parquet", use_deprecated_int96_timestamps=True, version="1.0")
pfile = pq.ParquetFile("test_round_trip.parquet")
pfile.schema_arrow # a: timestamp[ns]
pq.read_table("test_round_trip.parquet").to_pandas()
# # Results in values:
# 2169-02-08 23:09:07.419103232
# 2000-01-01 00:00:00
# 1830-11-23 00:50:52.580896768

The above example is just trying to demonstrate this bug by getting pyarrow to write out the parquet format to a similar state of original file (where this bug was discovered). This bug was originally found when trying to read in Parquet outputs from Amazon Athena with pyarrow (where we can't control the output format of the parquet file format) Context.

I found some existing issues that might also be related:

  • ARROW-10444
  • ARROW-6779 (This shows a similar response although testing this on pyarrow v3 will raise an out of bounds error)

Environment: macos mojave 10.14.6
Python 3.8.3
pyarrow 3.0.0
pandas 1.2.3
Reporter: Karik Isichei / @isichei
Assignee: Karik Isichei / @isichei

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-12096. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Hmm... ideally Athena would stop emitting the deprecated INT96 type.
In the meantime, solving this would require adding a new option to choose the mapped timestamp type.
cc @emkornfield

@asfimport
Copy link
Collaborator Author

Karik Isichei / @isichei:
@pitrou  yeah definitely agree. But I am not sure when they are going to do that (I haven't seen or heard anything that suggests it would be soon unfortunately). 

Happy to give it a go myself, is there a prescribed process that I would go about tackling this? Would a good start for me be to look through a similar PR (like this one: https://github.com/apache/arrow/pull/4597/files) to get an idea of how exposing a new option for the parquet reader in pyarrow would be done?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@isichei You first need to solve the issue on the C++ side. There are probably several places to look into:

  • ArrowReaderProperties in cpp/src/parquet/properties.h: you'll need to add the new option here

  • GetArrowType in cpp/src/parquet/arrow/schema_internal.cc: you'll need to take the option into account here

  • TransferInt96 in cpp/src/parquet/arrow/reader_internal.cc: ditto

    You'll also need to expand the existing tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc.

    Once that is done you'll be able to expose the option in Python.

@asfimport
Copy link
Collaborator Author

Karik Isichei / @isichei:
Thanks @pitrou  for pointing me in the right direction, appreciate it!

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Agree, IIUC int96 should be a deprecated type at this point.

@asfimport
Copy link
Collaborator Author

Karik Isichei / @isichei:
Hi all,

Have had some time to look at this / get my head around the CPP codebase. What makes sense to me following @pitrou's advice (exposing the ArrowReaderProperties options and taking them into account in GetArrowTypeTransferInt96) would be to do the following:

  1. Add the option like int96_timestamp_type_as =

  2. Change Int96GetNanoSeconds(in cpp/src/parquet/types.h) to something like Int96GetSeconds where it has an aditional parameter which is the interval size (again defaults to NS). Then if the interval was defined as anything other than NS the users may get truncated TS / data loss when converting the nanosecond component of the INT96 timestamp to the specified interval.

Some Questions on the above:

  • Does this sound like an O.K. approach to take? Not sure if renaming functions in your codebase is against any contribution styles (etc) as far as I can tell that function isn't really used too much so a rename wouldn't be too much effort. Otherwise would a preferred approach be to create a function for each INT96 -> Timestamp Interval?

  • Do I need to raise any warning stuff if timestamps are being truncated in the C++ code or is it something that would just be expressed in the docs? I.e. if you specify your own timestamp for INT96 that is at your own risk.

     

@asfimport
Copy link
Collaborator Author

Karik Isichei / @isichei:
Have created a PR for the fix (only on the C++ functionality).

#10461

 

Let me know if there are any problems or improvements. I was thinking of doing C++ and Python but thought better to solve the C++ functionality first and then do a secondary PR for exposing the functionality in Python. 

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Thanks for the PR @isichei. I've answered there.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 10461
#10461

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant