[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

asfimport · 2021-03-26T10:54:59Z

When reading Parquet data with timestamps stored as INT96 pyarrow will assume that the timestamp type should be nanoseconds and when converted into an arrow table will cause overflow if the parquet col has stored values that are out of bounds for nanoseconds.

# Round Trip Example
import datetime
import pandas as pd
import pyarrow as pa
from pyarrow import parquet as pq

df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
a_df = pa.Table.from_pandas(df)
a_df.schema # a: timestamp[us] 

pq.write_table(a_df, "test_round_trip.parquet", use_deprecated_int96_timestamps=True, version="1.0")
pfile = pq.ParquetFile("test_round_trip.parquet")
pfile.schema_arrow # a: timestamp[ns]
pq.read_table("test_round_trip.parquet").to_pandas()
# # Results in values:
# 2169-02-08 23:09:07.419103232
# 2000-01-01 00:00:00
# 1830-11-23 00:50:52.580896768

The above example is just trying to demonstrate this bug by getting pyarrow to write out the parquet format to a similar state of original file (where this bug was discovered). This bug was originally found when trying to read in Parquet outputs from Amazon Athena with pyarrow (where we can't control the output format of the parquet file format) Context.

I found some existing issues that might also be related:

ARROW-10444
ARROW-6779 (This shows a similar response although testing this on pyarrow v3 will raise an out of bounds error)

Environment: macos mojave 10.14.6
Python 3.8.3
pyarrow 3.0.0
pandas 1.2.3
Reporter: Karik Isichei / @isichei
Assignee: Karik Isichei / @isichei

Related issues:

[Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ (is required by)
[R] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ (is required by)

PRs and other links:

GitHub Pull Request #10461

_{Note: This issue was originally created as ARROW-12096. Please see the migration documentation for further details.}

asfimport · 2021-04-27T14:37:23Z

Antoine Pitrou / @pitrou:
Hmm... ideally Athena would stop emitting the deprecated INT96 type.
In the meantime, solving this would require adding a new option to choose the mapped timestamp type.
cc @emkornfield

asfimport · 2021-04-27T15:39:02Z

Karik Isichei / @isichei:
@pitrou yeah definitely agree. But I am not sure when they are going to do that (I haven't seen or heard anything that suggests it would be soon unfortunately).

Happy to give it a go myself, is there a prescribed process that I would go about tackling this? Would a good start for me be to look through a similar PR (like this one: https://github.com/apache/arrow/pull/4597/files) to get an idea of how exposing a new option for the parquet reader in pyarrow would be done?

asfimport · 2021-04-27T16:29:25Z

Antoine Pitrou / @pitrou:
@isichei You first need to solve the issue on the C++ side. There are probably several places to look into:

ArrowReaderProperties in cpp/src/parquet/properties.h: you'll need to add the new option here
GetArrowType in cpp/src/parquet/arrow/schema_internal.cc: you'll need to take the option into account here
TransferInt96 in cpp/src/parquet/arrow/reader_internal.cc: ditto

You'll also need to expand the existing tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc.

Once that is done you'll be able to expose the option in Python.

asfimport · 2021-04-30T19:54:04Z

Karik Isichei / @isichei:
Thanks @pitrou for pointing me in the right direction, appreciate it!

asfimport · 2021-05-02T04:54:14Z

Micah Kornfield / @emkornfield:
Agree, IIUC int96 should be a deprecated type at this point.

asfimport · 2021-05-04T22:49:44Z

Karik Isichei / @isichei:
Hi all,

Have had some time to look at this / get my head around the CPP codebase. What makes sense to me following @pitrou's advice (exposing the ArrowReaderProperties options and taking them into account in GetArrowType, TransferInt96) would be to do the following:

Add the option like int96_timestamp_type_as =
Change Int96GetNanoSeconds(in cpp/src/parquet/types.h) to something like Int96GetSeconds where it has an aditional parameter which is the interval size (again defaults to NS). Then if the interval was defined as anything other than NS the users may get truncated TS / data loss when converting the nanosecond component of the INT96 timestamp to the specified interval.

Some Questions on the above:

Does this sound like an O.K. approach to take? Not sure if renaming functions in your codebase is against any contribution styles (etc) as far as I can tell that function isn't really used too much so a rename wouldn't be too much effort. Otherwise would a preferred approach be to create a function for each INT96 -> Timestamp Interval?
Do I need to raise any warning stuff if timestamps are being truncated in the C++ code or is it something that would just be expressed in the docs? I.e. if you specify your own timestamp for INT96 that is at your own risk.

asfimport · 2021-06-07T07:58:25Z

Karik Isichei / @isichei:
Have created a PR for the fix (only on the C++ functionality).

#10461

Let me know if there are any problems or improvements. I was thinking of doing C++ and Python but thought better to solve the C++ functionality first and then do a secondary PR for exposing the functionality in Python.

asfimport · 2021-06-07T15:51:08Z

Antoine Pitrou / @pitrou:
Thanks for the PR @isichei. I've answered there.

asfimport · 2021-06-15T16:25:19Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 10461
#10461

asfimport closed this as completed Jun 15, 2021

asfimport added this to the 5.0.0 milestone Jan 11, 2023

This was referenced Jan 11, 2023

[Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #28792

Closed

[R] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #28793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

asfimport commented Mar 26, 2021 •

edited

Loading

asfimport commented Apr 27, 2021

asfimport commented Apr 27, 2021

asfimport commented Apr 27, 2021

asfimport commented Apr 30, 2021

asfimport commented May 2, 2021

asfimport commented May 4, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 15, 2021

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

[Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns]) #27920

Comments

asfimport commented Mar 26, 2021 • edited Loading

Related issues:

PRs and other links:

asfimport commented Apr 27, 2021

asfimport commented Apr 27, 2021

asfimport commented Apr 27, 2021

asfimport commented Apr 30, 2021

asfimport commented May 2, 2021

asfimport commented May 4, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 7, 2021

asfimport commented Jun 15, 2021

asfimport commented Mar 26, 2021 •

edited

Loading