-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36284: [Python][Parquet] Support write page index in Python API #36290
Conversation
|
28d6e68
to
789abf8
Compare
@jorisvandenbossche @pitrou Mind take a look? I'm not so familiar with Python part, so maybe make something wrong |
d758a74
to
39553b5
Compare
@pitrou @westonpace Would you mind take a look? This patch support Python to write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have a minor suggestion about the write_page_index
docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Can this patch be merged? Or should I wait for other committers review? |
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
These cases failed, how can I try to fix them? |
@mapleFU Those are unrelated to this PR. Can you try to rebase? |
Spark failures are known and have an issue opened.
Hypothesis failure is a new one but I do not see how it could be related to this PR.
I have seen nightlies fail with this error today already, so this is not related to the PR either. |
Hmm, that seems very similar to the one that I fixed last week (#36349, but now with another unknown timezone). In any case, you can ignore it here. |
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor): | |||
it will restore the timezone (Parquet only stores the UTC values without | |||
timezone), or columns with duration type will be restored from the int64 | |||
Parquet column. | |||
write_page_index : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side question: should we consider making this turned on by default at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently not, I found it's hard to implement page index pruning in current implementions. If we implements it, maybe we can change it to default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's not used already, it would probably be beneficial to write files with the index enabled, for future use.
Is there a performance issue with enabling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess most time there is no performance issue. But when user has extremly long string, we might write to much data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are allowed to trim the min/max values, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Here it will "discard" too long statistics, and discard the page index. I will implement truncate in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, we are currently not yet using the PageIndex when reading files (through the python APIs) for pruning pages when given a filter?
Should we mention that in the docstring to note that you can already write a PageIndex, but it will not yet be used when reading using pyarrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I've done that. By the way, we cannot filter using pyarrow, but parquet-rs
and parquet-mr
can optimize by it.
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
Still these failed, lol |
@pitrou @jorisvandenbossche I've tried to fix the comment here. Would you mind take a look? |
Co-authored-by: Alenka Frim <[email protected]>
2f764cd
to
5f789a6
Compare
@github-actions crossbow submit -g python |
Revision: 9840291 Submitted crossbow builds: ursacomputing/crossbow @ actions-f780c64692 |
…36290) ### Rationale for this change Support `write_page_index` in Parquet Python API ### What changes are included in this PR? support `write_page_index` in properties ### Are these changes tested? Currently not ### Are there any user-facing changes? User can generate page index here. * Closes: #36284 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 12f45ba. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
Support
write_page_index
in Parquet Python APIWhat changes are included in this PR?
support
write_page_index
in propertiesAre these changes tested?
Currently not
Are there any user-facing changes?
User can generate page index here.