ARROW-13086: [Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #10575

isichei · 2021-06-23T06:55:28Z

No description provided.

github-actions · 2021-06-23T06:55:46Z

https://issues.apache.org/jira/browse/ARROW-13086

pitrou · 2021-06-23T16:27:47Z

@isichei Can you take a look at the CI failures?

isichei · 2021-06-24T07:15:46Z

@pitrou - will take a look over the next few days. Bit confused as I am seeing tests pass on my side that are failing on CI (for example the tests I added) 🤔

Will try to replicate the docker CI build on my machine and get back to you

jorisvandenbossche · 2021-06-24T07:22:30Z

You migth have built Arrow/pyarrow without DATASET enabled?

The failures are from _dataset.pyx, you will need to add the new option to ParquetReadOptions / _PARQUET_READ_OPTIONS, I think.

isichei · 2021-06-29T09:01:32Z

I've gotten stuck on the dataset API and not sure what is missing:

pq.ParquetFile definitely works with the new coerce_int96_timestamp_unit parameter.
I've added the new parameter the ParquetScanOptions as in _dataset.pyx that class has access to the ArrowReaderProperties cpp class which has the setter for the parameter.
I have also added the parameter to the _ParquetDatasetV2 (in parquet.py) allowing it to be passed down to ParquetFileFormat and ParquetScanOptions.
I've added a test (in test_dataset.py::test_parquet_scan_options) to check that this is actually being set properly and as far as I can tell it is being set.
However, when i call pq.read_table which uses _ParquetDatasetV2 the test fails at it looks like coerce_int96_timestamp_unit is not being set on reading the parquet file and I can't figure out why (see parquet/test_datetime.py::test_coerce_int96_timestamp_overflow[read_table]).

### Next Steps

If someone else doesn't have time to look at this in more detail. It might be beneficial for me to just make this PR expose the parameter to ParquetFile and drop it from the Dataset API (so it is ready to merge for V5 release). Then I can create a new feature request on JIRA to expose the Dataset API to the new parameter.

Let me know what you think.

jorisvandenbossche · 2021-06-29T10:05:55Z

@isichei I will take a look

jorisvandenbossche · 2021-06-29T11:23:24Z

I think the option needed to be added to ParquetReadOptions instead of ParquetFragmentScanOptions, because it is an option that determines the schema of the resulting dataset, and not just how an individual fragment gets scanned.
To do that, it required a small addition in the C++ code for the Parquet Dataset.

isichei · 2021-06-30T15:59:25Z

@jorisvandenbossche thanks for the changes. I added a minor addition and test. The test might be a little redudent so happy to take out.

I've noted the failing Pandas test will push a fix later this evening.

isichei · 2021-07-02T09:31:57Z

OK, think tests are all now passing as far as I can tell and is ready to merge?

My apologies to your CI/CD resources there...

…quet reader.

… being set on parquet read in. Have added failing test to demonstrate issue. Will add more details on PR.

…param. Also added a test but the test feels slightly redundant so happy to take out.

…ime format

…mainly to act as additional documentation as to why / when you want to use this new parameter in the parquet reader.

jorisvandenbossche · 2021-07-16T10:00:33Z

@isichei I rebased this and did a small clean-up of the test (the reason Windows was failing was because of a wrong format string (%Y-%m-%s instead of %Y-%m-%d), but in the end I removed the comparison as strings alltogether).

isichei · 2021-07-16T13:40:24Z

Thanks @jorisvandenbossche much appreciated!

python/pyarrow/_parquet.pyx

python/pyarrow/parquet.py

kszucs · 2021-07-21T11:29:07Z

Thanks @isichei!

Followup to PR apache#10575

Followup to PR #10575 Closes #10766 from pitrou/ARROW-13086-refactor Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>

github-actions bot added the Component: Python label Jun 23, 2021

pitrou requested a review from jorisvandenbossche June 23, 2021 16:26

github-actions bot added the Component: C++ label Jun 29, 2021

isichei and others added 8 commits July 16, 2021 11:25

ARROW-13086: Exposing coerce_int96_timestamp_unit param to Python par…

43b2b64

…quet reader.

Extended coerce_int96_timestamp_unit to the Dataset API but still not…

c73a757

… being set on parquet read in. Have added failing test to demonstrate issue. Will add more details on PR.

move from ParquetFragmentScanOptions to ParquetReadOptions

c91596c

Added minor addition that returns read options with set coerce_int96 …

67dfe64

…param. Also added a test but the test feels slightly redundant so happy to take out.

Fixing Pandas backwards compatibility in test

030b644

Fixing Windows test failure due to windows not supporting %f in datet…

fc565af

…ime format

Windows build still failing for strftime so skipping as this test is …

5c9eb31

…mainly to act as additional documentation as to why / when you want to use this new parameter in the parquet reader.

simplify test

34898fd

jorisvandenbossche force-pushed the ARROW-13086 branch from a1e1ced to 34898fd Compare July 16, 2021 09:59

kszucs reviewed Jul 20, 2021

View reviewed changes

python/pyarrow/_parquet.pyx Show resolved Hide resolved

kszucs reviewed Jul 20, 2021

View reviewed changes

python/pyarrow/parquet.py Outdated Show resolved Hide resolved

Minor improvements

66b2f8a

kszucs approved these changes Jul 21, 2021

View reviewed changes

kszucs closed this in 6323c12 Jul 21, 2021

pitrou added a commit to pitrou/arrow that referenced this pull request Jul 21, 2021

ARROW-13086: [Python] De-duplicate time unit conversion code

ffbbb89

Followup to PR apache#10575

pitrou mentioned this pull request Jul 21, 2021

ARROW-13086: [Python] De-duplicate time unit conversion code #10766

Closed

asfimport mentioned this pull request Jul 21, 2021

[Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #28792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-13086: [Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #10575

ARROW-13086: [Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #10575

isichei commented Jun 23, 2021

github-actions bot commented Jun 23, 2021

pitrou commented Jun 23, 2021

isichei commented Jun 24, 2021

jorisvandenbossche commented Jun 24, 2021

isichei commented Jun 29, 2021

jorisvandenbossche commented Jun 29, 2021

jorisvandenbossche commented Jun 29, 2021

isichei commented Jun 30, 2021

isichei commented Jul 2, 2021

jorisvandenbossche commented Jul 16, 2021

isichei commented Jul 16, 2021

kszucs commented Jul 21, 2021

ARROW-13086: [Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #10575

ARROW-13086: [Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #10575

Conversation

isichei commented Jun 23, 2021

github-actions bot commented Jun 23, 2021

pitrou commented Jun 23, 2021

isichei commented Jun 24, 2021

jorisvandenbossche commented Jun 24, 2021

isichei commented Jun 29, 2021

jorisvandenbossche commented Jun 29, 2021

jorisvandenbossche commented Jun 29, 2021

isichei commented Jun 30, 2021

isichei commented Jul 2, 2021

jorisvandenbossche commented Jul 16, 2021

isichei commented Jul 16, 2021

kszucs commented Jul 21, 2021