ENH: add decode_as_image() to ContentStreams #2615

pubpub-zz · 2024-05-01T12:18:36Z

codecov · 2024-05-01T12:25:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.14%. Comparing base (3c9f449) to head (604e2b8).
Report is 57 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2615   +/-   ##
=======================================
  Coverage   95.13%   95.14%           
=======================================
  Files          51       51           
  Lines        8538     8547    +9     
  Branches     1702     1703    +1     
=======================================
+ Hits         8123     8132    +9     
  Misses        261      261           
  Partials      154      154

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stefan6419846 · 2024-05-01T14:32:26Z

Should we really expect the users to basically call decode_image on every object with arbitrary nesting as there might be a "hidden" image somewhere? This feels rather strange.

Additionally, what happens when it is no image? We log a warning, but is there an exception as well due to invalid image data? If yes, why both?

pubpub-zz · 2024-05-01T16:28:24Z

Should we really expect the users to basically call decode_image on every object with arbitrary nesting as there might be a "hidden" image somewhere? This feels rather strange.

Why strange. This offers a way to get the image from an stream where images are present but not part of the images (such as the use in pattern as provided in B2.pdf, but also in annotations)

Additionally, what happens when it is no image? We log a warning, but is there an exception as well due to invalid image data? If yes, why both?

I thought about this and my concern is that this may hide some actual issues. I've completed the annotation

stefan6419846 · 2024-05-02T13:50:00Z

I am still not sure whether we can really expect the user to examine every content stream for a possible image. Personally, I would prefer a clean solution, thus I am going to leave this PR open for further discussion.

pubpub-zz · 2024-06-09T09:53:25Z

I've reviewed quickly the PDF 1.7 spec, and there is many objects not part of the current .images[]. within pages I've found thumbnails, alternate images, and currently patterns, and possibly mask images (as independent images). There is also some images not stores in pages: thumbnaiils within linearized documents and within annotations (such as stamps where images are stores within [/AP][/N][/Resources][/XObjects]).
I may have lost also some elements.

At least providing a function to ease extraction of images for other developers should be an improvements

stefan6419846 · 2024-06-09T09:55:08Z

In this case, could you please fix the merge conflicts and add some basic example to the docs?

pypdf/generic/_data_structures.py

Co-authored-by: Stefan <[email protected]>

pubpub-zz · 2024-06-09T10:10:05Z

test doc for example in documentation:
test_stamp.pdf

docs/user/extract-images.md

Co-authored-by: Stefan <[email protected]>

@pubpub-zz

## What's new ### New Features (ENH) - Accept ETen-B5 and UniCNS-UTF16 encodings (#2721) by @pubpub-zz - Add decode_as_image() to ContentStreams (#2615) by @pubpub-zz - context manager for PdfReader (#2666) by @tibor-reiss - Add capability to set font and size in fields (#2636) by @pubpub-zz - Allow to pass input file without named argument (#2576) by @pubpub-zz ### Bug Fixes (BUG) - Fix deprecation for Ressources when using old constants (#2705) by @stefan6419846 - Fix images issue 4 bits encoding and LUT starting with UTF16_BOM (#2675) by @pubpub-zz - Reading large compressed images takes huge time to process (#2644) by @snanda85 - Highlighted Text Cannot Be Printed (#2604) by @Nifury - Fix UnboundLocalError on malformed pdf (#2619) by @farjasju ### Documentation (DOC) - Various improvements on docstrings and examples by @j-t-1 ### Robustness (ROB) - Cope with missing Standard 14 fonts in fields (#2677) by @pubpub-zz - Improve inline image extraction (#2622) by @pubpub-zz - Cope with loops in Fields tree (#2656) by @pubpub-zz - Discard /I in choice fields for compatibility with Acrobat (#2614) by @pubpub-zz - Cope with some issues in pillow (#2595) by @pubpub-zz - Cope with some image extraction issues (#2591) by @pubpub-zz ### Maintenance (MAINT) - Deprecate interiour_color with replacement interior_color (#2706) by @j-t-1 - Add deprecate_with_replacement to PdfWriter.find_bookmark (#2674) by @j-t-1 ### Code Style (STY) - Change Link to be a non-markup annotation (#2714) by @j-t-1 [Full Changelog](4.2.0...4.3.0)

ENH: add decode_as_image() to ContentStreams

854c467

closes py-pdf#2613

add annotation about exceptions

0fb2a73

pubpub-zz added 2 commits May 1, 2024 22:50

fix doc

6b83ef5

Merge branch 'main' into iss2613

b68b907

pubpub-zz requested a review from stefan6419846 May 2, 2024 12:51

Merge branch 'main' into iss2613

8d76f9d

stefan6419846 reviewed Jun 9, 2024

View reviewed changes

pypdf/generic/_data_structures.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jun 9, 2024

View reviewed changes

pypdf/generic/_data_structures.py Outdated Show resolved Hide resolved

pubpub-zz and others added 2 commits June 9, 2024 11:56

Update pypdf/generic/_data_structures.py

57c787c

Co-authored-by: Stefan <[email protected]>

Update pypdf/generic/_data_structures.py

66aa394

Co-authored-by: Stefan <[email protected]>

fix test + add documentation

9ced094

stefan6419846 reviewed Jun 9, 2024

View reviewed changes

docs/user/extract-images.md Show resolved Hide resolved

stefan6419846 reviewed Jun 9, 2024

View reviewed changes

docs/user/extract-images.md Outdated Show resolved Hide resolved

pubpub-zz and others added 2 commits June 9, 2024 12:33

Update docs/user/extract-images.md

561412d

Co-authored-by: Stefan <[email protected]>

style

604e2b8

pubpub-zz requested a review from stefan6419846 June 9, 2024 10:42

stefan6419846 approved these changes Jun 9, 2024

View reviewed changes

stefan6419846 merged commit 26d1615 into py-pdf:main Jun 9, 2024
16 checks passed

stefan6419846 mentioned this pull request Jun 25, 2024

ENH: consider images inside PDF made with onlyoffice #2637

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add decode_as_image() to ContentStreams #2615

ENH: add decode_as_image() to ContentStreams #2615

pubpub-zz commented May 1, 2024

codecov bot commented May 1, 2024 •

edited

Loading

stefan6419846 commented May 1, 2024

pubpub-zz commented May 1, 2024

stefan6419846 commented May 2, 2024

pubpub-zz commented Jun 9, 2024

stefan6419846 commented Jun 9, 2024

pubpub-zz commented Jun 9, 2024

ENH: add decode_as_image() to ContentStreams #2615

ENH: add decode_as_image() to ContentStreams #2615

Conversation

pubpub-zz commented May 1, 2024

codecov bot commented May 1, 2024 • edited Loading

Codecov Report

stefan6419846 commented May 1, 2024

pubpub-zz commented May 1, 2024

stefan6419846 commented May 2, 2024

pubpub-zz commented Jun 9, 2024

stefan6419846 commented Jun 9, 2024

pubpub-zz commented Jun 9, 2024

codecov bot commented May 1, 2024 •

edited

Loading