Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uses buffered input for inline images to speed up reading #390

Conversation

archivsozialebewegungen

There are pdf files out there in the wild which use large inline images - although this is not recommended by the pdf specification. If you try to open one of these files with PyPDF2 the process seems to be stuck because for inline images the read is done byte by byte. This patch introduces buffering for inline images, the actual (huge) buffer size has been found experimentally to get the best reading speed.

@MartinThoma MartinThoma added nf-performance Non-functional change: Performance Tiny Pull requests that make a tiny change - and thus should be easy to merge labels Apr 6, 2022
@MartinThoma
Copy link
Member

Thank you for your contribution!

@MartinThoma
Copy link
Member

#740 which was merged+released yesterday, probably already did this

@MartinThoma
Copy link
Member

I know it's been a long time since you created this PR. Would you mind to check if your PR still adds value (and potentially fix the merge conflicts?)

@MartinThoma MartinThoma added the needs-discussion The PR/issue needs more discussion before we can continue label Apr 16, 2022
@archivsozialebewegungen
Copy link
Author

After a first glance over the code changed according to #740, the buffering introduced should mitigate the problem. I'm not really sure if the find/seek solution for detecting the end of the image data stream is faster or slower than my regex solution, but this should not make a big difference. More of a concern might be the buffersize of only 8k, I used 1m for a reason, so 8k might still result in poor performance for large images. I will create a performance test next week when I'm in the office again where I have real life data samples.

@MartinThoma
Copy link
Member

Thank you so much! I would love having performance tests in the test suite! (maybe even in CI?)

@MartinThoma
Copy link
Member

Hi @archivsozialebewegungen! Did you have the time to run performance tests?

@MartinThoma
Copy link
Member

The main point of this PR was to use a buffer / read in bigger chunks.

We do read 8kB chunks now: https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/generic.py#L1176

As the code base has changed quite a bit, I'm closing this PR now. Feel free to submit another PR (I'll handle that one quicker 🤞 )

@archivsozialebewegungen
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion The PR/issue needs more discussion before we can continue nf-performance Non-functional change: Performance Tiny Pull requests that make a tiny change - and thus should be easy to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants