Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pdf reader getting stuck when trying to read large files wihhout xref marker #808

Conversation

dsk7
Copy link
Contributor

@dsk7 dsk7 commented Apr 23, 2022

When trying to find the xref marker, the PDF reader code the file backwards and builds a so called line by concatenating strings in a loop.
This leads to O(n^2) performance. For files where the xref marker can not be found at the end this takes a enormous amount of time:

1mb of zeros at the end: 45.54 seconds
2mb of zeros at the end: 357.04 seconds
(measured on a laptop made in 2015)

This pull request changes the relevant section of the code to become O(n), leading to a run time of less then 1 second for both cases mentioned above. Furthermore this PR adds a test to prevent regression.

Unit tests have been run manually on Python 2.7.18 and Python 3.8.10.

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2022

Codecov Report

Merging #808 (699e1ad) into main (3d65938) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #808   +/-   ##
=======================================
  Coverage   75.22%   75.22%           
=======================================
  Files          11       11           
  Lines        3516     3516           
  Branches      810      810           
=======================================
  Hits         2645     2645           
  Misses        658      658           
  Partials      213      213           
Impacted Files Coverage Δ
PyPDF2/pdf.py 81.85% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d65938...699e1ad. Read the comment docs.

@dsk7 dsk7 force-pushed the fix-pdf-reader-getting-stuck-on-large-files-without-startxref-marker branch from 49ed48a to 3be7ec7 Compare April 23, 2022 15:28
@dsk7 dsk7 changed the title fix pdf reader getting stuck when trying to read large files wihhout xref marker draft: fix pdf reader getting stuck when trying to read large files wihhout xref marker Apr 23, 2022
@dsk7 dsk7 force-pushed the fix-pdf-reader-getting-stuck-on-large-files-without-startxref-marker branch from 3be7ec7 to e4c4b9a Compare April 23, 2022 15:39
@dsk7 dsk7 force-pushed the fix-pdf-reader-getting-stuck-on-large-files-without-startxref-marker branch from e4c4b9a to 699e1ad Compare April 23, 2022 15:44
@dsk7 dsk7 changed the title draft: fix pdf reader getting stuck when trying to read large files wihhout xref marker fix pdf reader getting stuck when trying to read large files wihhout xref marker Apr 23, 2022
@MartinThoma MartinThoma merged commit c6c56f5 into py-pdf:main Apr 23, 2022
@MartinThoma
Copy link
Member

@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test ❤️ It's people like you who make open source absolutely amazing 🤗

@MartinThoma
Copy link
Member

Trying to give you a little bit of the honor you deserve: https://twitter.com/_martinthoma/status/1517915519906729985 :-)

@Butterfly
Copy link

@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test heart It's people like you who make open source absolutely amazing hugs

That's too funny! I JUST tweeted this very thread a few seconds ago myself because it happened to be the one in my face. I'm going to pump your CONTRIBUTORS md out there, too. This Github has been something else to watch the last few weeks. It has turned into such an instantly nice, supportive spot out on the Web here. 👏 🥳

MartinThoma added a commit that referenced this pull request Apr 24, 2022
A change I would like to highlight is the performance improvement for
large PDF files (#808) 🎉

New Features (ENH):
-  Add papersizes (#800)
-  Allow setting permission flags when encrypting (#803)
-  Allow setting form field flags (#802)

Bug Fixes (BUG):
-  TypeError in xmp._converter_date (#813)
-  Improve spacing for text extraction (#806)
-  Fix PDFDocEncoding Character Set (#809)

Robustness (ROB):
-  Use null ID when encrypted but no ID given (#812)
-  Handle recursion error (#804)

Documentation (DOC):
-  CMaps (#811)
-  The PDF Format + commit prefixes (#810)
-  Add compression example (#792)

Developer Experience (DEV):
-  Add Benchmark for Performance Testing (#781)

Maintenance (MAINT):
-  Validate PDF magic byte in strict mode (#814)
-  Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339)
-  Quadratic runtime while parsing reduced to linear  (#808)

Testing (TST):
-  Newlines in text extraction (#807)

Full Changelog: 1.27.8...1.27.9
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
When the PdfFileReader tries to find the xref marker, the readNextEndLine methods builds a so called line by reading byte-for-byte. Every time a new byte is read, it is concatenated with the currently read line. This leads to quadratic runtime O(n²) behavior as Python strings (also byte-strings) are immutable and have to be copied where n is the size of the file.
For files where the xref marker can not be found at the end this takes a enormous amount of time:

* 1mb of zeros at the end: 45.54 seconds
* 2mb of zeros at the end: 357.04 seconds
(measured on a laptop made in 2015)

This pull request changes the relevant section of the code to become linear runtime O(n), leading to a run time of less then a second for both cases mentioned above. Furthermore this PR adds a regression test.
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
A change I would like to highlight is the performance improvement for
large PDF files (py-pdf#808) 🎉

New Features (ENH):
-  Add papersizes (py-pdf#800)
-  Allow setting permission flags when encrypting (py-pdf#803)
-  Allow setting form field flags (py-pdf#802)

Bug Fixes (BUG):
-  TypeError in xmp._converter_date (py-pdf#813)
-  Improve spacing for text extraction (py-pdf#806)
-  Fix PDFDocEncoding Character Set (py-pdf#809)

Robustness (ROB):
-  Use null ID when encrypted but no ID given (py-pdf#812)
-  Handle recursion error (py-pdf#804)

Documentation (DOC):
-  CMaps (py-pdf#811)
-  The PDF Format + commit prefixes (py-pdf#810)
-  Add compression example (py-pdf#792)

Developer Experience (DEV):
-  Add Benchmark for Performance Testing (py-pdf#781)

Maintenance (MAINT):
-  Validate PDF magic byte in strict mode (py-pdf#814)
-  Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339)
-  Quadratic runtime while parsing reduced to linear  (py-pdf#808)

Testing (TST):
-  Newlines in text extraction (py-pdf#807)

Full Changelog: py-pdf/pypdf@1.27.8...1.27.9
@dsk7
Copy link
Contributor Author

dsk7 commented May 2, 2022

Trying to give you a little bit of the honor you deserve: https://twitter.com/_martinthoma/status/1517915519906729985 :-)

@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test heart It's people like you who make open source absolutely amazing hugs

Thanks for your kind words. I'm happy to be able to help this cool project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance nf-security Non-functional change: Security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants