-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add PageObject.images attribute #1330
Conversation
Codecov ReportBase: 94.71% // Head: 94.55% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1330 +/- ##
==========================================
- Coverage 94.71% 94.55% -0.16%
==========================================
Files 30 28 -2
Lines 5181 5016 -165
Branches 1060 1033 -27
==========================================
- Hits 4907 4743 -164
Misses 164 164
+ Partials 110 109 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@pubpub-zz @MasterOdin What do you think about this PR? While I wrote it, I realized that PyPDF2 does something wrong with image extraction in some cases. I marked those tests with xfail. The point of this PR is not to fix those issues, but to provide a convenient interface for getting images from PDF pages. That means:
@pubpub-zz You mentioned that this method might not get all images of a page. For this PR, this would be acceptable to me. We can fix that later. As a follow-up step we might use the I'm uncertain about the The reason why I chose mime-type were spelling inconsistencies like this:
Additionally, I'm uncertain if using extension vs mime_type makes a difference if we use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Perhaps We should add in TODO about inline images (not extracted yet)
Also add a warning message if unknown type
mime type looks more appropriate for me too
PyPDF2/_page.py
Outdated
filename = f"{obj[1:]}.{File._mime2extension(mime_type)}" | ||
images_extracted.append( | ||
File(name=filename, data=byte_stream, mime_type=mime_type) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strikes me as an odd abstraction, where we are passing in the mime_type
as part of the File
constructor, but we also need to construct the full filename, using a private static function to boot, but also that the file_extension
method doesn't correspond to the extension of the passed in name
, but rather mime_type
.
If we go the route of passing in the mime_type
for the File, I'd advocate for just passing in name
sans extension altogether and we can have a special property function that does the concatenation of name + extension to give a "filename" on demand as needed by users.
The only caveat would be for attachments, it may make sense to pass in the full filename, but I'm not well versed on that part of the spec to even know how that API might look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll go with 'File only has name + data (no mime_type)' for the moment, because it seems to have only advantages:
- Less clutter / less code to maintain
- No potential to discover the wrong mime type
- We could make
_xobj_to_image
just pass the file extension as before - As
_xobj_to_image
is a private function, we can easily change the behavior if we see a clear advantage
This is consistent with Pillow: https://pillow.readthedocs.io/en/latest/reference/Image.html#PIL.Image.Image.format Co-authored-by: Matthew Peveler <[email protected]>
21b54ae
to
c491881
Compare
New Features (ENH): - Addition of optional visitor-functions in extract_text() (#1252) - Add metadata.creation_date and modification_date (#1364) - Add PageObject.images attribute (#1330) Bug Fixes (BUG): - Lookup index in _xobj_to_image can be ByteStringObject (#1366) - \'IndexError: index out of range\' when using extract_text (#1361) - Errors in transfer_rotation_to_content() (#1356) Robustness (ROB): - Ensure update_page_form_field_values does not fail if no fields (#1346) Testing (TST): - read_string_from_stream performance (#1355) Full Changelog: 2.10.9...2.11.0
No description provided.