fix pdf reader getting stuck when trying to read large files wihhout xref marker #808

dsk7 · 2022-04-23T15:23:47Z

When trying to find the xref marker, the PDF reader code the file backwards and builds a so called line by concatenating strings in a loop.
This leads to O(n^2) performance. For files where the xref marker can not be found at the end this takes a enormous amount of time:

1mb of zeros at the end: 45.54 seconds
2mb of zeros at the end: 357.04 seconds
(measured on a laptop made in 2015)

This pull request changes the relevant section of the code to become O(n), leading to a run time of less then 1 second for both cases mentioned above. Furthermore this PR adds a test to prevent regression.

Unit tests have been run manually on Python 2.7.18 and Python 3.8.10.

codecov-commenter · 2022-04-23T15:26:53Z

Codecov Report

Merging #808 (699e1ad) into main (3d65938) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #808   +/-   ##
=======================================
  Coverage   75.22%   75.22%           
=======================================
  Files          11       11           
  Lines        3516     3516           
  Branches      810      810           
=======================================
  Hits         2645     2645           
  Misses        658      658           
  Partials      213      213

Impacted Files	Coverage Δ
PyPDF2/pdf.py	`81.85% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d65938...699e1ad. Read the comment docs.

…hout xref marker

MartinThoma · 2022-04-23T17:13:35Z

@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test ❤️ It's people like you who make open source absolutely amazing 🤗

MartinThoma · 2022-04-23T17:17:52Z

Trying to give you a little bit of the honor you deserve: https://twitter.com/_martinthoma/status/1517915519906729985 :-)

Butterfly · 2022-04-23T17:51:37Z

@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test heart It's people like you who make open source absolutely amazing hugs

That's too funny! I JUST tweeted this very thread a few seconds ago myself because it happened to be the one in my face. I'm going to pump your CONTRIBUTORS md out there, too. This Github has been something else to watch the last few weeks. It has turned into such an instantly nice, supportive spot out on the Web here. 👏 🥳

A change I would like to highlight is the performance improvement for large PDF files (#808) 🎉 New Features (ENH): - Add papersizes (#800) - Allow setting permission flags when encrypting (#803) - Allow setting form field flags (#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (#813) - Improve spacing for text extraction (#806) - Fix PDFDocEncoding Character Set (#809) Robustness (ROB): - Use null ID when encrypted but no ID given (#812) - Handle recursion error (#804) Documentation (DOC): - CMaps (#811) - The PDF Format + commit prefixes (#810) - Add compression example (#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339) - Quadratic runtime while parsing reduced to linear (#808) Testing (TST): - Newlines in text extraction (#807) Full Changelog: 1.27.8...1.27.9

When the PdfFileReader tries to find the xref marker, the readNextEndLine methods builds a so called line by reading byte-for-byte. Every time a new byte is read, it is concatenated with the currently read line. This leads to quadratic runtime O(n²) behavior as Python strings (also byte-strings) are immutable and have to be copied where n is the size of the file. For files where the xref marker can not be found at the end this takes a enormous amount of time: * 1mb of zeros at the end: 45.54 seconds * 2mb of zeros at the end: 357.04 seconds (measured on a laptop made in 2015) This pull request changes the relevant section of the code to become linear runtime O(n), leading to a run time of less then a second for both cases mentioned above. Furthermore this PR adds a regression test.

A change I would like to highlight is the performance improvement for large PDF files (py-pdf#808) 🎉 New Features (ENH): - Add papersizes (py-pdf#800) - Allow setting permission flags when encrypting (py-pdf#803) - Allow setting form field flags (py-pdf#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (py-pdf#813) - Improve spacing for text extraction (py-pdf#806) - Fix PDFDocEncoding Character Set (py-pdf#809) Robustness (ROB): - Use null ID when encrypted but no ID given (py-pdf#812) - Handle recursion error (py-pdf#804) Documentation (DOC): - CMaps (py-pdf#811) - The PDF Format + commit prefixes (py-pdf#810) - Add compression example (py-pdf#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (py-pdf#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (py-pdf#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339) - Quadratic runtime while parsing reduced to linear (py-pdf#808) Testing (TST): - Newlines in text extraction (py-pdf#807) Full Changelog: py-pdf/pypdf@1.27.8...1.27.9

dsk7 · 2022-05-02T21:19:26Z

Trying to give you a little bit of the honor you deserve: https://twitter.com/_martinthoma/status/1517915519906729985 :-)

@dsk7 Amazing work! I will feature this in the next release notes - I absolutely love it! Thank you for adjusting the PR and for putting in the extra effort of creating a regression test heart It's people like you who make open source absolutely amazing hugs

Thanks for your kind words. I'm happy to be able to help this cool project!

dsk7 force-pushed the fix-pdf-reader-getting-stuck-on-large-files-without-startxref-marker branch from 49ed48a to 3be7ec7 Compare April 23, 2022 15:28

dsk7 changed the title ~~fix pdf reader getting stuck when trying to read large files wihhout xref marker~~ draft: fix pdf reader getting stuck when trying to read large files wihhout xref marker Apr 23, 2022

dsk7 force-pushed the fix-pdf-reader-getting-stuck-on-large-files-without-startxref-marker branch from 3be7ec7 to e4c4b9a Compare April 23, 2022 15:39

BUG: fix pdf reader getting stuck when trying to read large files wit…

699e1ad

…hout xref marker

dsk7 force-pushed the fix-pdf-reader-getting-stuck-on-large-files-without-startxref-marker branch from e4c4b9a to 699e1ad Compare April 23, 2022 15:44

dsk7 changed the title ~~draft: fix pdf reader getting stuck when trying to read large files wihhout xref marker~~ fix pdf reader getting stuck when trying to read large files wihhout xref marker Apr 23, 2022

MartinThoma merged commit c6c56f5 into py-pdf:main Apr 23, 2022

MartinThoma added the nf-performance Non-functional change: Performance label Apr 23, 2022

MartinThoma mentioned this pull request Apr 23, 2022

fix pdf reader getting stuck on large files without startxref marker #295

Closed

ztravis mentioned this pull request Jun 8, 2022

Optimize read_next_end_line #646

Merged

MartinThoma mentioned this pull request Nov 1, 2022

Use process_time instead of just time for measuring test performance. #1413

Closed

MartinThoma mentioned this pull request Jun 30, 2023

Quadratic runtime with malformed PDF missing xref marker #582

Closed

MartinThoma added the nf-security Non-functional change: Security label Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix pdf reader getting stuck when trying to read large files wihhout xref marker #808

fix pdf reader getting stuck when trying to read large files wihhout xref marker #808

dsk7 commented Apr 23, 2022

codecov-commenter commented Apr 23, 2022 •

edited

Loading

MartinThoma commented Apr 23, 2022

MartinThoma commented Apr 23, 2022

Butterfly commented Apr 23, 2022

dsk7 commented May 2, 2022

fix pdf reader getting stuck when trying to read large files wihhout xref marker #808

fix pdf reader getting stuck when trying to read large files wihhout xref marker #808

Conversation

dsk7 commented Apr 23, 2022

codecov-commenter commented Apr 23, 2022 • edited Loading

Codecov Report

MartinThoma commented Apr 23, 2022

MartinThoma commented Apr 23, 2022

Butterfly commented Apr 23, 2022

dsk7 commented May 2, 2022

codecov-commenter commented Apr 23, 2022 •

edited

Loading