-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Remove redundant S3 call #33972
Comments
The datasets feature went through considerable change a while back when it moved from a parquet-only feature to format-agnostic. Looks like this connection came loose in the conversion. If you just want to read one file the approach is normally something more like:
If you're looking to read a collection of files you would normally use:
I suspect (though am not entirely certain) both of the above paths will only read the metadata once. However, your usage is legitimate, and it even affects the normal datasets path when you scan the dataset multiple times (because we should be caching the metadata on the first scan and reusing on the second). So I would consider this a bug. I don't know for sure but my guess is the problem is here. The fragment is opening a reader and should pass the metadata to the reader, if already populated. |
You mentioned in the other issue that you want to reuse the connection. Could you clarify a little bit to your larger goal? Or perhaps do you have some example code somewhere of how you're planning on using this? For example, are you bringing in an S3 connection from outside of pyarrow? Or do you start with a path? Are you reading from the same dataset multiple times or is this a one-shot operation (or the list of files changes from call to call)? |
@westonpace sure thing! We need to make projections, and we need to have the schema before loading the data. For example, if you have an Iceberg table, and you do a rename on a column, then you don't want to rewrite your multi-petabyte table. Iceberg uses IDs to identify the column, and if you filter or project on that column, it will select the old column name in the files that are written before the rename. The current code is over here: https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L486-L522 |
Ok, that helps. In the short term I think you should use Longer term, you can probably just specify a custom evolution strategy (using parquet column IDs) and let pyarrow handle the expression conversion for you. Sadly, this feature is not yet ready (I'm working on it when I can. 🤞 for 12.0.0) |
The simple |
That's not a problem, as long as it keeps cached in the fragment. Because the reverse bytes to get the footer are rather expensive (in terms of time), so we would love to eliminate that call. I went through the code, and was able to pass down the metadata from the fragment down to the reader: #34015
I agree, we need to have predicate pushdown 👍🏻
Let me know when something is ready, happy to test 👍🏻 |
Closes #33972 ### Rationale for this change ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: #33972 Lead-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Closes apache#33972 ### Rationale for this change ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#33972 Lead-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Closes apache#33972 ### Rationale for this change ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#33972 Lead-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Closes apache#33972 ### Rationale for this change ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#33972 Lead-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Describe the enhancement requested
Hey all,
First of all thanks everyone for working on PyArrow! Really loving it so far. I'm currently working on PyIceberg that will load an Iceberg table in PyArrow. For those unfamiliar with Apache Iceberg. This is a table format that focusses on having huge tables (petabyte size). PyIceberg makes you life easier by taking care of statistics to boost performance, and all the schema maintenance. For example, if you change the partitioning of an Iceberg table, you don't have to directly rewrite all the files, you can do this in an incremental way.
Now I'm running into some performance issues, and I noticed that PyArrow is doing more queries than required to S3. I went down the rabbit hole, and was able to narrow it down to:
I need the schema first, because it can be that a column got renamed, but the the file hasn't been rewritten against the latest schema. The same goes for filtering, if you change a column name, and the file still has the old name in there, then you would like to leverage the predicate pushdown of PyArrow to not load the data in memory at all.
When looking into the minio logs I can see that it does four requests.
Looking at the tests, we shouldn't fetch the footer twice:
Any thoughts or advice? I went through the code a bit already, but my cpp is a bit rusty
Component(s)
Python
The text was updated successfully, but these errors were encountered: