PI: Avoid string concatenation with large embedded base64-encoded images #1350

mergezalot · 2022-09-16T11:35:13Z

Certain PDF libraries do embed images as base64 strings. This causes performance issues in read_string_from_stream due to incremental string concatenation, byte by byte.

PDF Lib in our case is

<xmp:CreatorTool>Canon iR-ADV C256  PDF</xmp:CreatorTool>
<pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer>

mergezalot · 2022-09-16T11:37:46Z

The correct behaviour is tested by existing unit tests. I am not sure if you want performance tests on large files in your code base or not. Runtime on a 4mb PDF with a single embedded base64 image was several minutes with the old code, whereas with the new code it is roughly a second.

Sadly i did not yet manage to create a sample PDF yet, and the one PDF i have contains sensitive data.

Certain PDF libraries do embed images as base64 strings. This causes performance issues in `read_string_from_stream` due to incremental string concatenation, byte by byte. PDF Lib in our case is ``` <xmp:CreatorTool>Canon iR-ADV C256 PDF</xmp:CreatorTool> <pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer> ```

codecov · 2022-09-16T11:50:37Z

Codecov Report

Base: 94.63% // Head: 94.63% // No change to project coverage 👍

Coverage data is based on head (28306de) compared to base (7c96d13).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1350   +/-   ##
=======================================
  Coverage   94.63%   94.63%           
=======================================
  Files          30       30           
  Lines        5140     5140           
  Branches     1058     1058           
=======================================
  Hits         4864     4864           
  Misses        164      164           
  Partials      112      112

Impacted Files	Coverage Δ
PyPDF2/generic/_utils.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

MartinThoma · 2022-09-17T09:45:40Z

Thank you for the PR ❤️

MartinThoma · 2022-09-17T09:50:30Z

I am not sure if you want performance tests on large files in your code base or not.

Yes, I want that! We have the https://github.com/py-pdf/sample-files repository included as a git submodule to ensure the PyPDF2 repository itself remains small (that is important if people install directly from source).

I also want tests that are expected to be slow to be marked with a pytest marker. Currently we only have external (code), but we could add a slow marker as well.

Additionally, we have a benchmark over several PDF libraries: https://github.com/py-pdf/benchmarks
It might make sense to add a few "special" cases there (PRs are welcome :-) )

MartinThoma · 2022-09-17T10:03:44Z

https://stackoverflow.com/a/65019934/562769 is a micro-benchmark that essentially shows this improvement.

MartinThoma · 2022-09-17T10:06:30Z

Thank you! It is merged to main and will be part of the next release to PyPI (likely today or tomorrow)

mergezalot · 2022-09-17T17:18:28Z

Thank you for the speedy merge @MartinThoma. We did not yet manage to create offending testdata, but will give it another try.

New Features (ENH): - Add rotation property and transfer_rotate_to_content (#1348) Performance Improvements (PI): - Avoid string concatenation with large embedded base64-encoded images (#1350) Bug Fixes (BUG): - Format floats using their intrinsic decimal precision (#1267) Robustness (ROB): - Fix merge_page for pages without resources (#1349) Full Changelog: 2.10.8...2.10.9

mergezalot · 2022-09-20T06:26:55Z

Note to self: found out how the original PDF with embedded base64 image was created, will file a test later.
base64image.pdf

There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests py-pdf#1350.

Source: py-pdf/pypdf#1350 (comment) Co-authored-by: Michael Karlen <[email protected]>

MartinThoma · 2022-09-24T05:05:13Z

I've added the example file: https://github.com/py-pdf/sample-files

There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests #1350. Co-authored-by: Michael Karlen <[email protected]>

mergezalot force-pushed the fix-read_string_from_stream-performance branch from 01c1956 to a41c497 Compare September 16, 2022 11:42

mergezalot force-pushed the fix-read_string_from_stream-performance branch from a41c497 to 28306de Compare September 16, 2022 11:43

mergezalot changed the title ~~Fix performance issues with large embedded base64 images~~ BUG: fix performance issues with large embedded base64 images Sep 16, 2022

mergezalot changed the title ~~BUG: fix performance issues with large embedded base64 images~~ BUG: fix performance issues with large embedded base64 Sep 16, 2022

MartinThoma changed the title ~~BUG: fix performance issues with large embedded base64~~ PERF: Avoid repeated string concatenation with large embedded base64 images Sep 17, 2022

MartinThoma changed the title ~~PERF: Avoid repeated string concatenation with large embedded base64 images~~ PERF: Avoid string concatenation with large embedded base64-encoded images Sep 17, 2022

MartinThoma added the nf-performance Non-functional change: Performance label Sep 17, 2022

MartinThoma changed the title ~~PERF: Avoid string concatenation with large embedded base64-encoded images~~ PI: Avoid string concatenation with large embedded base64-encoded images Sep 17, 2022

MartinThoma merged commit 3be01fd into py-pdf:main Sep 17, 2022

mergezalot added a commit to mergezalot/PyPDF2 that referenced this pull request Sep 20, 2022

Test read_string_from_stream-performance

258d6c4

There is a saftey margin of a factor of 10 in both directions, so the test should be fairly stable. Tests py-pdf#1350.

mergezalot mentioned this pull request Sep 20, 2022

TST: read_string_from_stream performance #1355

Merged

MartinThoma added a commit to py-pdf/sample-files that referenced this pull request Sep 24, 2022

ENH: Add PDF with base64-encoded image

fee1d72

Source: py-pdf/pypdf#1350 (comment) Co-authored-by: Michael Karlen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PI: Avoid string concatenation with large embedded base64-encoded images #1350

PI: Avoid string concatenation with large embedded base64-encoded images #1350

mergezalot commented Sep 16, 2022

mergezalot commented Sep 16, 2022

codecov bot commented Sep 16, 2022

MartinThoma commented Sep 17, 2022

MartinThoma commented Sep 17, 2022

MartinThoma commented Sep 17, 2022

MartinThoma commented Sep 17, 2022

mergezalot commented Sep 17, 2022

mergezalot commented Sep 20, 2022

MartinThoma commented Sep 24, 2022

PI: Avoid string concatenation with large embedded base64-encoded images #1350

PI: Avoid string concatenation with large embedded base64-encoded images #1350

Conversation

mergezalot commented Sep 16, 2022

mergezalot commented Sep 16, 2022

codecov bot commented Sep 16, 2022

Codecov Report

MartinThoma commented Sep 17, 2022

MartinThoma commented Sep 17, 2022

MartinThoma commented Sep 17, 2022

MartinThoma commented Sep 17, 2022

mergezalot commented Sep 17, 2022

mergezalot commented Sep 20, 2022

MartinThoma commented Sep 24, 2022