-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uses buffered input for inline images to speed up reading #390
Uses buffered input for inline images to speed up reading #390
Conversation
Thank you for your contribution! |
#740 which was merged+released yesterday, probably already did this |
I know it's been a long time since you created this PR. Would you mind to check if your PR still adds value (and potentially fix the merge conflicts?) |
After a first glance over the code changed according to #740, the buffering introduced should mitigate the problem. I'm not really sure if the find/seek solution for detecting the end of the image data stream is faster or slower than my regex solution, but this should not make a big difference. More of a concern might be the buffersize of only 8k, I used 1m for a reason, so 8k might still result in poor performance for large images. I will create a performance test next week when I'm in the office again where I have real life data samples. |
Thank you so much! I would love having performance tests in the test suite! (maybe even in CI?) |
Hi @archivsozialebewegungen! Did you have the time to run performance tests? |
The main point of this PR was to use a buffer / read in bigger chunks. We do read 8kB chunks now: https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/generic.py#L1176 As the code base has changed quite a bit, I'm closing this PR now. Feel free to submit another PR (I'll handle that one quicker 🤞 ) |
I'm also sorry for answering late. I tried yesterday to find one of the
pdf files that made trouble for us and could not find one. Finally I
looped over all our pdf-Files and found no performance issues even with
version 1.26. I have two possible explanations for this puzzling
behaviour:
- A performance boost by better hardware (unlikely in this degree)
- A much better memory management when reading files with Python 3.9
compared to Python 3.2, where the problem arose and made me introduce
buffering
I think, the latter is the most likely explanation, although I do not
have proof. So in short: The problem seems to be solved with and
without buffering and I fully agree with closing the PR.
Kind regards
Michael
Am Dienstag, dem 14.06.2022 um 12:48 -0700 schrieb Martin Thoma:
… The main point of this PR was to use a buffer / read in bigger
chunks.
We do read 8kB chunks now:
https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/generic.py#L1176
As the code base has changed quite a bit, I'm closing this PR now.
Feel free to submit another PR (I'll handle that one quicker 🤞 )
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Michael Koltan
Runzstraße 6
79102 Freiburg
Telefon 0761 76 78 033
Mobil 0152 52951842
|
There are pdf files out there in the wild which use large inline images - although this is not recommended by the pdf specification. If you try to open one of these files with PyPDF2 the process seems to be stuck because for inline images the read is done byte by byte. This patch introduces buffering for inline images, the actual (huge) buffer size has been found experimentally to get the best reading speed.