-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance find zip archive #1664
Improve performance find zip archive #1664
Conversation
Suggestions from @martindurant in code review. Co-authored-by: Martin Durant <[email protected]>
Thank you for the review @martindurant! I refactored the code quite a bit, to make use of the Regarding removing the I implemented it as suggested by you now, but just filtering for it at the end if it is set. |
I wonder, should this code go up into AbstractArchiveFileSystem, since it may have a similar performance benefit for libarchive or tar filesystems? I'm not sure if find is slow there or if there's anything else to worry about. |
@martindurant Let me know if you are happy with this or if there is anything else you'd like to see. When it comes to moving to evaluating this libarchive/tar file systems I unfortunately don't think that I will have the bandwidth to take that on. |
Understood - maybe someone else gets the itch if indeed speed is a problem for the other ones. |
@martindurant thank you for a very enjoyable review process, and thanks from brining this in. It will help our use-case immensely. |
This is an attempt to address #1662. Working on a code base that makes extensive us of zip archives containing parquet files that need to be read on the fly to load data, I noticed that
find
was a major bottleneck in that process.I have made a zip specific implementation of
find
here that relies on the file list of the zip file, rather than explicitly walking.Here is how I measured the performance:
Performance for find on current
master
Performance for find with this fix applied:
Let me know what you think.