Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't extract images for this PDF #3936

Closed
bbfrog opened this issue Oct 10, 2024 · 5 comments
Closed

Can't extract images for this PDF #3936

bbfrog opened this issue Oct 10, 2024 · 5 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@bbfrog
Copy link

bbfrog commented Oct 10, 2024

Description of the bug

Monaleesa_full.pdf
Pymupdf can't extract images in page 2 and page 4 of this pdf.

How to reproduce the bug

import pymupdf
doc = pymupdf.open('Monaleesa_full.pdf')

page_num = 0
for page in doc:
  page_num += 1
  images = page.get_images(full=True)
  print(f'page {page_num}: {len(images)} images')

PyMuPDF version

1.24.11

Operating system

MacOS

Python version

3.12

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Oct 10, 2024
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 10, 2024

Except for page 7 (0-based), none of the pages contains an image.
What you see are vector graphics - no images.

@JorjMcKie
Copy link
Collaborator

Vector graphics cannot be extracted. All you can do is making a "photo" of the respective page area ...

@bbfrog
Copy link
Author

bbfrog commented Oct 12, 2024

Acrobat API can extract the vector graphics and save as png or svg. How does it do this? Is it hard to support in Pymupdf? THanks!

@JorjMcKie
Copy link
Collaborator

You can try this script. Or do this:

import pymupdf

doc = pymupdf.open("input.pdf")
for page in doc:
    for i, bbox in enumerate(page.cluster_drawings()):
        pix = page.get_pixmap(clip=bbox, dpi=150)
        pix.save(f"{doc.name}-{page.number}-{i}.png")

@bbfrog
Copy link
Author

bbfrog commented Oct 15, 2024

Thanks @JorjMcKie very much. It works and can extract the image I want. But it also extracted tables from this pdf as drawing, is there any field can differentiate the tables with other drawing? Thanks!

@pymupdf pymupdf locked and limited conversation to collaborators Oct 15, 2024
@JorjMcKie JorjMcKie converted this issue into discussion #3948 Oct 15, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants