-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PI: Avoid string concatenation with large embedded base64-encoded images #1350
PI: Avoid string concatenation with large embedded base64-encoded images #1350
Conversation
The correct behaviour is tested by existing unit tests. I am not sure if you want performance tests on large files in your code base or not. Runtime on a 4mb PDF with a single embedded base64 image was several minutes with the old code, whereas with the new code it is roughly a second. Sadly i did not yet manage to create a sample PDF yet, and the one PDF i have contains sensitive data. |
01c1956
to
a41c497
Compare
Certain PDF libraries do embed images as base64 strings. This causes performance issues in `read_string_from_stream` due to incremental string concatenation, byte by byte. PDF Lib in our case is ``` <xmp:CreatorTool>Canon iR-ADV C256 PDF</xmp:CreatorTool> <pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer> ```
a41c497
to
28306de
Compare
Codecov ReportBase: 94.63% // Head: 94.63% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #1350 +/- ##
=======================================
Coverage 94.63% 94.63%
=======================================
Files 30 30
Lines 5140 5140
Branches 1058 1058
=======================================
Hits 4864 4864
Misses 164 164
Partials 112 112
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Thank you for the PR ❤️ |
Yes, I want that! We have the https://github.com/py-pdf/sample-files repository included as a git submodule to ensure the PyPDF2 repository itself remains small (that is important if people install directly from source). I also want tests that are expected to be slow to be marked with a pytest marker. Currently we only have Additionally, we have a benchmark over several PDF libraries: https://github.com/py-pdf/benchmarks |
https://stackoverflow.com/a/65019934/562769 is a micro-benchmark that essentially shows this improvement. |
Thank you! It is merged to |
Thank you for the speedy merge @MartinThoma. We did not yet manage to create offending testdata, but will give it another try. |
New Features (ENH): - Add rotation property and transfer_rotate_to_content (#1348) Performance Improvements (PI): - Avoid string concatenation with large embedded base64-encoded images (#1350) Bug Fixes (BUG): - Format floats using their intrinsic decimal precision (#1267) Robustness (ROB): - Fix merge_page for pages without resources (#1349) Full Changelog: 2.10.8...2.10.9
Note to self: found out how the original PDF with embedded base64 image was created, will file a test later. |
There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests py-pdf#1350.
Source: py-pdf/pypdf#1350 (comment) Co-authored-by: Michael Karlen <[email protected]>
I've added the example file: https://github.com/py-pdf/sample-files |
There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests #1350. Co-authored-by: Michael Karlen <[email protected]>
Certain PDF libraries do embed images as base64 strings. This causes performance issues in
read_string_from_stream
due to incremental string concatenation, byte by byte.PDF Lib in our case is