Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13086: [Python] Expose Parquet ArrowReaderProperties::coerce_int96_timestamp_unit_ #10575

Closed
wants to merge 9 commits into from

Conversation

isichei
Copy link
Contributor

@isichei isichei commented Jun 23, 2021

No description provided.

@github-actions
Copy link

@pitrou
Copy link
Member

pitrou commented Jun 23, 2021

@isichei Can you take a look at the CI failures?

@isichei
Copy link
Contributor Author

isichei commented Jun 24, 2021

@pitrou - will take a look over the next few days. Bit confused as I am seeing tests pass on my side that are failing on CI (for example the tests I added) 🤔

Will try to replicate the docker CI build on my machine and get back to you

@jorisvandenbossche
Copy link
Member

You migth have built Arrow/pyarrow without DATASET enabled?

The failures are from _dataset.pyx, you will need to add the new option to ParquetReadOptions / _PARQUET_READ_OPTIONS, I think.

@isichei
Copy link
Contributor Author

isichei commented Jun 29, 2021

I've gotten stuck on the dataset API and not sure what is missing:

  • pq.ParquetFile definitely works with the new coerce_int96_timestamp_unit parameter.
  • I've added the new parameter the ParquetScanOptions as in _dataset.pyx that class has access to the ArrowReaderProperties cpp class which has the setter for the parameter.
  • I have also added the parameter to the _ParquetDatasetV2 (in parquet.py) allowing it to be passed down to ParquetFileFormat and ParquetScanOptions.
  • I've added a test (in test_dataset.py::test_parquet_scan_options) to check that this is actually being set properly and as far as I can tell it is being set.
  • However, when i call pq.read_table which uses _ParquetDatasetV2 the test fails at it looks like coerce_int96_timestamp_unit is not being set on reading the parquet file and I can't figure out why (see parquet/test_datetime.py::test_coerce_int96_timestamp_overflow[read_table]).

### Next Steps

If someone else doesn't have time to look at this in more detail. It might be beneficial for me to just make this PR expose the parameter to ParquetFile and drop it from the Dataset API (so it is ready to merge for V5 release). Then I can create a new feature request on JIRA to expose the Dataset API to the new parameter.

Let me know what you think.

@jorisvandenbossche
Copy link
Member

@isichei I will take a look

@jorisvandenbossche
Copy link
Member

I think the option needed to be added to ParquetReadOptions instead of ParquetFragmentScanOptions, because it is an option that determines the schema of the resulting dataset, and not just how an individual fragment gets scanned.
To do that, it required a small addition in the C++ code for the Parquet Dataset.

@isichei
Copy link
Contributor Author

isichei commented Jun 30, 2021

@jorisvandenbossche thanks for the changes. I added a minor addition and test. The test might be a little redudent so happy to take out.

I've noted the failing Pandas test will push a fix later this evening.

@isichei
Copy link
Contributor Author

isichei commented Jul 2, 2021

OK, think tests are all now passing as far as I can tell and is ready to merge?

My apologies to your CI/CD resources there...

isichei and others added 8 commits July 16, 2021 11:25
… being set on parquet read in. Have added failing test to demonstrate issue. Will add more details on PR.
…param. Also added a test but the test feels slightly redundant so happy to take out.
…mainly to act as additional documentation as to why / when you want to use this new parameter in the parquet reader.
@jorisvandenbossche
Copy link
Member

@isichei I rebased this and did a small clean-up of the test (the reason Windows was failing was because of a wrong format string (%Y-%m-%s instead of %Y-%m-%d), but in the end I removed the comparison as strings alltogether).

@isichei
Copy link
Contributor Author

isichei commented Jul 16, 2021

Thanks @jorisvandenbossche much appreciated!

python/pyarrow/parquet.py Outdated Show resolved Hide resolved
@kszucs kszucs closed this in 6323c12 Jul 21, 2021
@kszucs
Copy link
Member

kszucs commented Jul 21, 2021

Thanks @isichei!

pitrou added a commit to pitrou/arrow that referenced this pull request Jul 21, 2021
kszucs pushed a commit that referenced this pull request Jul 21, 2021
Followup to PR #10575

Closes #10766 from pitrou/ARROW-13086-refactor

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants