Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36284: [Python][Parquet] Support write page index in Python API #36290

Merged
merged 10 commits into from
Jul 10, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jun 25, 2023

Rationale for this change

Support write_page_index in Parquet Python API

What changes are included in this PR?

support write_page_index in properties

Are these changes tested?

Currently not

Are there any user-facing changes?

User can generate page index here.

@github-actions
Copy link

⚠️ GitHub issue #36284 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Jun 25, 2023

@jorisvandenbossche @pitrou Mind take a look? I'm not so familiar with Python part, so maybe make something wrong

@mapleFU mapleFU marked this pull request as ready for review June 26, 2023 16:31
@mapleFU mapleFU requested a review from AlenkaF as a code owner June 26, 2023 16:31
@mapleFU mapleFU force-pushed the parquet/enable-write-page-index branch 2 times, most recently from d758a74 to 39553b5 Compare July 3, 2023 05:15
@mapleFU
Copy link
Member Author

mapleFU commented Jul 3, 2023

@pitrou @westonpace Would you mind take a look? This patch support Python to write page_index.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have a minor suggestion about the write_page_index docstrings.

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 4, 2023
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mapleFU
Copy link
Member Author

mapleFU commented Jul 5, 2023

Can this patch be merged? Or should I wait for other committers review?

@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

@github-actions crossbow submit -g python

@github-actions

This comment was marked as outdated.

@mapleFU
Copy link
Member Author

mapleFU commented Jul 5, 2023

  • test-conda-python-3.10-spark-master
  • test-cuda-python
  • test-conda-python-3.8-spark-v3.1.2
  • test-conda-python-3.10-spark-master

These cases failed, how can I try to fix them?

@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

@mapleFU Those are unrelated to this PR. Can you try to rebase?

@AlenkaF
Copy link
Member

AlenkaF commented Jul 5, 2023

  • test-conda-python-3.10-spark-master
  • test-conda-python-3.8-spark-v3.1.2
  • test-conda-python-3.9-spark-v3.2.0

Spark failures are known and have an issue opened.

  • test-conda-python-3.11-hypothesis

Hypothesis failure is a new one but I do not see how it could be related to this PR.

  • test-cuda-python

I have seen nightlies fail with this error today already, so this is not related to the PR either.

@jorisvandenbossche
Copy link
Member

Hypothesis failure is a new one but I do not see how it could be related to this PR.

Hmm, that seems very similar to the one that I fixed last week (#36349, but now with another unknown timezone). In any case, you can ignore it here.

@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
it will restore the timezone (Parquet only stores the UTC values without
timezone), or columns with duration type will be restored from the int64
Parquet column.
write_page_index : bool, default False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question: should we consider making this turned on by default at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not, I found it's hard to implement page index pruning in current implementions. If we implements it, maybe we can change it to default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's not used already, it would probably be beneficial to write files with the index enabled, for future use.
Is there a performance issue with enabling it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most time there is no performance issue. But when user has extremly long string, we might write to much data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are allowed to trim the min/max values, right?

Copy link
Member Author

@mapleFU mapleFU Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Here it will "discard" too long statistics, and discard the page index. I will implement truncate in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, we are currently not yet using the PageIndex when reading files (through the python APIs) for pruning pages when given a filter?

Should we mention that in the docstring to note that you can already write a PageIndex, but it will not yet be used when reading using pyarrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I've done that. By the way, we cannot filter using pyarrow, but parquet-rs and parquet-mr can optimize by it.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jul 5, 2023
@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jul 5, 2023
@github-actions

This comment was marked as outdated.

@mapleFU
Copy link
Member Author

mapleFU commented Jul 5, 2023

Still these failed, lol

python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_metadata.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 6, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Jul 7, 2023

@pitrou @jorisvandenbossche I've tried to fix the comment here. Would you mind take a look?

@pitrou pitrou force-pushed the parquet/enable-write-page-index branch from 2f764cd to 5f789a6 Compare July 10, 2023 15:31
@pitrou
Copy link
Member

pitrou commented Jul 10, 2023

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: 9840291

Submitted crossbow builds: ursacomputing/crossbow @ actions-f780c64692

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-master Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.1.2 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-conda-python-3.9-spark-v3.2.0 Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

@pitrou pitrou merged commit 12f45ba into apache:main Jul 10, 2023
@pitrou pitrou removed the awaiting change review Awaiting change review label Jul 10, 2023
raulcd pushed a commit that referenced this pull request Jul 11, 2023
…36290)

### Rationale for this change

Support `write_page_index` in Parquet Python API

### What changes are included in this PR?

support `write_page_index` in properties

### Are these changes tested?

Currently not

### Are there any user-facing changes?

User can generate page index here.

* Closes: #36284

Lead-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 12f45ba.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][Parquet] Support write page index in Parquet
4 participants