-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix pdf reader getting stuck when trying to read large files wihhout xref marker #808
fix pdf reader getting stuck when trying to read large files wihhout xref marker #808
Conversation
Codecov Report
@@ Coverage Diff @@
## main #808 +/- ##
=======================================
Coverage 75.22% 75.22%
=======================================
Files 11 11
Lines 3516 3516
Branches 810 810
=======================================
Hits 2645 2645
Misses 658 658
Partials 213 213
Continue to review full report at Codecov.
|
49ed48a
to
3be7ec7
Compare
3be7ec7
to
e4c4b9a
Compare
e4c4b9a
to
699e1ad
Compare
@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test ❤️ It's people like you who make open source absolutely amazing 🤗 |
Trying to give you a little bit of the honor you deserve: https://twitter.com/_martinthoma/status/1517915519906729985 :-) |
That's too funny! I JUST tweeted this very thread a few seconds ago myself because it happened to be the one in my face. I'm going to pump your CONTRIBUTORS md out there, too. This Github has been something else to watch the last few weeks. It has turned into such an instantly nice, supportive spot out on the Web here. 👏 🥳 |
A change I would like to highlight is the performance improvement for large PDF files (#808) 🎉 New Features (ENH): - Add papersizes (#800) - Allow setting permission flags when encrypting (#803) - Allow setting form field flags (#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (#813) - Improve spacing for text extraction (#806) - Fix PDFDocEncoding Character Set (#809) Robustness (ROB): - Use null ID when encrypted but no ID given (#812) - Handle recursion error (#804) Documentation (DOC): - CMaps (#811) - The PDF Format + commit prefixes (#810) - Add compression example (#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339) - Quadratic runtime while parsing reduced to linear (#808) Testing (TST): - Newlines in text extraction (#807) Full Changelog: 1.27.8...1.27.9
When the PdfFileReader tries to find the xref marker, the readNextEndLine methods builds a so called line by reading byte-for-byte. Every time a new byte is read, it is concatenated with the currently read line. This leads to quadratic runtime O(n²) behavior as Python strings (also byte-strings) are immutable and have to be copied where n is the size of the file. For files where the xref marker can not be found at the end this takes a enormous amount of time: * 1mb of zeros at the end: 45.54 seconds * 2mb of zeros at the end: 357.04 seconds (measured on a laptop made in 2015) This pull request changes the relevant section of the code to become linear runtime O(n), leading to a run time of less then a second for both cases mentioned above. Furthermore this PR adds a regression test.
A change I would like to highlight is the performance improvement for large PDF files (py-pdf#808) 🎉 New Features (ENH): - Add papersizes (py-pdf#800) - Allow setting permission flags when encrypting (py-pdf#803) - Allow setting form field flags (py-pdf#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (py-pdf#813) - Improve spacing for text extraction (py-pdf#806) - Fix PDFDocEncoding Character Set (py-pdf#809) Robustness (ROB): - Use null ID when encrypted but no ID given (py-pdf#812) - Handle recursion error (py-pdf#804) Documentation (DOC): - CMaps (py-pdf#811) - The PDF Format + commit prefixes (py-pdf#810) - Add compression example (py-pdf#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (py-pdf#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (py-pdf#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339) - Quadratic runtime while parsing reduced to linear (py-pdf#808) Testing (TST): - Newlines in text extraction (py-pdf#807) Full Changelog: py-pdf/pypdf@1.27.8...1.27.9
Thanks for your kind words. I'm happy to be able to help this cool project! |
When trying to find the xref marker, the PDF reader code the file backwards and builds a so called line by concatenating strings in a loop.
This leads to O(n^2) performance. For files where the xref marker can not be found at the end this takes a enormous amount of time:
1mb of zeros at the end: 45.54 seconds
2mb of zeros at the end: 357.04 seconds
(measured on a laptop made in 2015)
This pull request changes the relevant section of the code to become O(n), leading to a run time of less then 1 second for both cases mentioned above. Furthermore this PR adds a test to prevent regression.
Unit tests have been run manually on Python 2.7.18 and Python 3.8.10.